Overview

Running your own inference endpoint gives you full control over model selection, GPU allocation, and cost. Chalk Compute supports autoscaling containers with GPU access — deploy a vLLM server once and let Chalk scale it from zero to your peak traffic.

This tutorial deploys Gemma 4 using vLLM with autoscaling and persistent model caching.


Cache model weights with a volume

Large model files (multi-GB) should live in a Volume so they persist across container restarts and are shared across replicas. This avoids re-downloading weights every time a container scales up.

from chalkcompute import Volume

vol = Volume(name="gemma4-weights")

On first boot, vLLM downloads the model into the Hugging Face cache directory. By mounting the volume at that path, subsequent containers — including new replicas created by autoscaling — start serving immediately.


Define the container

Create deploy_gemma4.py:

from chalkcompute import Container, Image, Volume

vol = Volume(name="gemma4-weights")

image = (
    Image.base("vllm/vllm-openai:latest")
    .run_commands(
        "pip install huggingface_hub",
    )
)

container = Container(
    image=image,
    name="gemma4-vllm",
    env={
        "HF_TOKEN": "hf_...",                    # Hugging Face access token
        "HUGGING_FACE_HUB_TOKEN": "hf_...",
    },
    port=8000,
    volumes={"gemma4-weights": "/root/.cache/huggingface"},
    min_instances=1,
    max_instances=4,
    max_concurrent_requests=32,
    entrypoint=[
        "python", "-m", "vllm.entrypoints.openai.api_server",
        "--model", "google/gemma-3-27b-it",
        "--host", "0.0.0.0",
        "--port", "8000",
        "--tensor-parallel-size", "1",
        "--max-model-len", "8192",
        "--dtype", "auto",
    ],
).run()

print(f"Inference endpoint: {container.info.web_url}")

Key parameters

ParameterPurpose
min_instances=1Keep at least one replica warm — no cold starts.
max_instances=4Scale up to 4 replicas under load.
max_concurrent_requests=32Each replica handles up to 32 concurrent requests before Chalk routes to another.
volumes={...}Mount the weight cache so new replicas skip the download.

Deploy it

chalk compute deploy deploy_gemma4.py
# ✓ Container created successfully
# Container ID: c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6
# Name: gemma4-vllm
# Status: Running
# Pod Name: chalk-container-gemma4-vllm
# URL: https://c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6.compute.chalk.ai

Query the endpoint

vLLM exposes an OpenAI-compatible API. Point any OpenAI client at your container URL:

from openai import OpenAI

client = OpenAI(
    base_url="https://c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6.compute.chalk.ai/v1",
    api_key="not-needed",  # no auth required within Chalk
)

response = client.chat.completions.create(
    model="google/gemma-3-27b-it",
    messages=[
        {"role": "user", "content": "Explain feature stores in two sentences."},
    ],
)

print(response.choices[0].message.content)

Or with curl:

curl https://c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6.compute.chalk.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-27b-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Scaling behavior

Chalk monitors the request queue across all replicas. When max_concurrent_requests is reached on every running instance, a new replica spins up — pulling weights from the shared volume instead of re-downloading them.

When traffic drops, Chalk scales back down to min_instances. Set min_instances=0 for development workloads where cold starts are acceptable.

# Dev configuration — scale to zero when idle
container = Container(
    image=image,
    name="gemma4-dev",
    port=8000,
    volumes={"gemma4-weights": "/root/.cache/huggingface"},
    min_instances=0,
    max_instances=2,
    max_concurrent_requests=8,
    entrypoint=[...],
).run()