Compute
Deploy open-weight models like Gemma 4 with vLLM on Chalk Compute.
Running your own inference endpoint gives you full control over model selection, GPU allocation, and cost. Chalk Compute supports autoscaling containers with GPU access — deploy a vLLM server once and let Chalk scale it from zero to your peak traffic.
This tutorial deploys Gemma 4 using vLLM with autoscaling and persistent model caching.
Large model files (multi-GB) should live in a Volume so they persist across container
restarts and are shared across replicas. This avoids re-downloading weights every time a
container scales up.
from chalkcompute import Volume
vol = Volume(name="gemma4-weights")On first boot, vLLM downloads the model into the Hugging Face cache directory. By mounting the volume at that path, subsequent containers — including new replicas created by autoscaling — start serving immediately.
Create deploy_gemma4.py:
from chalkcompute import Container, Image, Volume
vol = Volume(name="gemma4-weights")
image = (
Image.base("vllm/vllm-openai:latest")
.run_commands(
"pip install huggingface_hub",
)
)
container = Container(
image=image,
name="gemma4-vllm",
env={
"HF_TOKEN": "hf_...", # Hugging Face access token
"HUGGING_FACE_HUB_TOKEN": "hf_...",
},
port=8000,
volumes={"gemma4-weights": "/root/.cache/huggingface"},
min_instances=1,
max_instances=4,
max_concurrent_requests=32,
entrypoint=[
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "google/gemma-3-27b-it",
"--host", "0.0.0.0",
"--port", "8000",
"--tensor-parallel-size", "1",
"--max-model-len", "8192",
"--dtype", "auto",
],
).run()
print(f"Inference endpoint: {container.info.web_url}")| Parameter | Purpose |
|---|---|
min_instances=1 | Keep at least one replica warm — no cold starts. |
max_instances=4 | Scale up to 4 replicas under load. |
max_concurrent_requests=32 | Each replica handles up to 32 concurrent requests before Chalk routes to another. |
volumes={...} | Mount the weight cache so new replicas skip the download. |
chalk compute deploy deploy_gemma4.py
# ✓ Container created successfully
# Container ID: c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6
# Name: gemma4-vllm
# Status: Running
# Pod Name: chalk-container-gemma4-vllm
# URL: https://c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6.compute.chalk.aivLLM exposes an OpenAI-compatible API. Point any OpenAI client at your container URL:
from openai import OpenAI
client = OpenAI(
base_url="https://c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6.compute.chalk.ai/v1",
api_key="not-needed", # no auth required within Chalk
)
response = client.chat.completions.create(
model="google/gemma-3-27b-it",
messages=[
{"role": "user", "content": "Explain feature stores in two sentences."},
],
)
print(response.choices[0].message.content)Or with curl:
curl https://c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6.compute.chalk.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-27b-it",
"messages": [{"role": "user", "content": "Hello!"}]
}'Chalk monitors the request queue across all replicas. When max_concurrent_requests is
reached on every running instance, a new replica spins up — pulling weights from the
shared volume instead of re-downloading them.
When traffic drops, Chalk scales back down to min_instances. Set min_instances=0
for development workloads where cold starts are acceptable.
# Dev configuration — scale to zero when idle
container = Container(
image=image,
name="gemma4-dev",
port=8000,
volumes={"gemma4-weights": "/root/.cache/huggingface"},
min_instances=0,
max_instances=2,
max_concurrent_requests=8,
entrypoint=[...],
).run()