Scaling Groups

A scaling group is a replicated, HTTP-fronted service managed by Chalk. Each group is backed by a Kubernetes Deployment with rolling updates, automatic service discovery, and CPU-utilization-based autoscaling between configurable min and max replica counts. Use scaling groups for ML inference servers, internal APIs, agent backends, and any long-lived service that needs to be reachable from outside the cluster.

For workloads that don’t need replication or HTTP fronting, use a Container instead. For serverless, function-shaped invocations, use Functions.

You can drive scaling groups two ways: the Python SDK (chalkcompute.ScalingGroup) or the chalk scaling-group CLI. Both surfaces target the same underlying service, so groups created one way are visible and manageable from the other.

Quick start (SDK)

from chalkcompute import ScalingGroup, Image

img = (
    Image.debian_slim("3.12")
    .pip_install(["flask"])
    .add_local_file("./app.py", "/app/app.py")
    .entrypoint(["python", "/app/app.py"])
)

sg = ScalingGroup(
    image=img,
    name="hello-api",
    port=8080,
    min_replicas=1,
    max_replicas=3,
).deploy().wait_ready()

resp = sg.call("/health", method="GET")
print(resp.status_code, resp.text)

sg.delete()

ScalingGroup(...).deploy() builds the image, uploads any local files declared via add_local_file / add_local_dir, creates the scaling group, and waits until it reaches Running. Chain .wait_ready() to also wait for at least one replica to be ready to serve traffic.

Quick start (CLI)

$ chalk scaling-group create \
    --image=my-registry/my-inference-server:v1 \
    --name=inference \
    --port=8080

This creates a Kubernetes Deployment with a single replica running your image, a headless Service, and an HTTPRoute that exposes the service over HTTPS.

You can tune replicas and resource limits at creation time:

$ chalk scaling-group create \
    --image=my-registry/my-inference-server:v1 \
    --name=inference \
    --port=8080 \
    --replicas=3 \
    --cpu=4 \
    --memory=8Gi

Replicas and autoscaling

The SDK exposes the autoscaler controls directly on the ScalingGroup constructor:

Parameter	Default	Description
`min_replicas`	`1`	Lower bound for the autoscaler. Set to `0` to allow scale-to-zero.
`max_replicas`	`1`	Upper bound for the autoscaler.
`target_cpu_utilization_percentage`	unset (cluster default)	Average CPU utilization the autoscaler targets.
`scaling_interval_seconds`	unset	How often the autoscaler re-evaluates.
`shutdown_delay_seconds`	unset	Grace period when removing replicas, to drain in-flight requests.

sg = ScalingGroup(
    image=img,
    name="inference",
    port=8080,
    min_replicas=2,
    max_replicas=20,
    target_cpu_utilization_percentage=70,
    shutdown_delay_seconds=30,
).deploy().wait_ready()

From the CLI, --replicas=N sets a fixed replica count. For dynamic autoscaling, use the SDK or edit the underlying spec.

CPU and memory resources

The --cpu and --memory flags (CLI) and the cpu / memory constructor arguments (SDK) control the resource requests and limits for each replica. When specified, both the Kubernetes request and limit are set to the same value, guaranteeing your workload the resources it asks for.

$ chalk scaling-group create \
    --image=my-registry/my-app:v1 \
    --name=my-service \
    --port=8080 \
    --cpu=2 \
    --memory=4Gi

If you omit these flags, the following defaults are applied:

Resource	Request	Limit
CPU	`100m`	`1`
Memory	`256Mi`	`1Gi`

CPU values follow Kubernetes conventions: 1 means one full core, 500m means half a core. Memory values use standard suffixes: Mi (mebibytes) and Gi (gibibytes).

GPU support

Scaling groups support GPU-accelerated workloads. The SDK and CLI both accept a type:count value (or just count) for the GPU request:

sg = ScalingGroup(
    image=img,
    name="gpu-inference",
    port=8080,
    cpu="8",
    memory="32Gi",
    gpu="nvidia-l4:1",
    min_replicas=1,
    max_replicas=4,
).deploy().wait_ready()

$ chalk scaling-group create \
    --image=my-registry/my-gpu-model:v1 \
    --name=gpu-inference \
    --port=8080 \
    --gpu=nvidia-tesla-t4:1 \
    --cpu=4 \
    --memory=16Gi

The type portion (e.g. nvidia-tesla-t4) is used to select the right node pool via a Kubernetes node selector (cloud.google.com/gke-accelerator), while the count sets the nvidia.com/gpu resource request and limit on the container. A toleration for nvidia.com/gpu is applied automatically so the pods can schedule onto GPU-tainted nodes.

If you don’t need to target a specific GPU type (for example on EKS where node selection is handled differently), pass just the count:

$ chalk scaling-group create \
    --image=my-registry/my-gpu-model:v1 \
    --name=gpu-inference \
    --port=8080 \
    --gpu=1

Available GPU types depend on what node pools are configured in your cluster. Common GKE values include:

GPU type	Description
`nvidia-tesla-t4`	NVIDIA T4 (cost-effective inference)
`nvidia-tesla-a100`	NVIDIA A100 (high-performance training and inference)
`nvidia-l4`	NVIDIA L4 (balanced inference)
`nvidia-tesla-v100`	NVIDIA V100 (training)

To request multiple GPUs, increase the count:

$ chalk scaling-group create \
    --image=my-registry/my-training-server:v1 \
    --name=multi-gpu \
    --port=8080 \
    --gpu=nvidia-tesla-a100:4

DNS and routing

Every scaling group is automatically assigned a TLS-terminated DNS hostname through the cluster’s gateway. The pattern is:

<environment-id>-<scaling-group-name>.<gateway-domain>

For example, a scaling group named inference in environment abc123 with gateway domain gw.chalk.ai would be reachable at:

https://abc123-inference.gw.chalk.ai

From the SDK, read it off the deployed group:

print(sg.web_url)
# https://abc123-inference.gw.chalk.ai

From the CLI, the URL is shown as Web URL in the output of chalk scaling-group get and chalk scaling-group list. Traffic arriving at this hostname is routed to the port you specified with --port (or the port constructor argument).

Because the hostname includes the scaling group name, you can reference it by a stable, human-readable URL rather than an opaque UUID.

Invoking from the SDK

The .call() helper composes paths against web_url and returns an httpx.Response. Use it for ad-hoc invocations from the SDK — production traffic should hit web_url directly.

resp = sg.call(
    "/predict",
    method="POST",
    json={"text": "the cat sat on the mat"},
    timeout=60.0,
)
print(resp.json())

Environment, secrets, and volumes

These are wired identically to containers:

from chalkcompute import ScalingGroup, Image, Secret

sg = ScalingGroup(
    image=img,
    name="api",
    port=8080,
    env={"LOG_LEVEL": "INFO"},
    secrets=[Secret.from_env("OPENAI_API_KEY")],
    volumes=[("training-data", "/data")],
    min_replicas=1,
    max_replicas=3,
).deploy().wait_ready()

Secrets are resolved at deploy time and injected as environment variables in every replica. Volume mounts are shared across all replicas, so use them for read-mostly data — if multiple replicas write to the same volume path concurrently, the last writer wins on the next sync.

Managing scaling groups

Listing

$ chalk scaling-group list

Displays a table of all scaling groups in your environment with their ID, name, image, status, replica counts, URL, and creation time.

from chalk.client import ChalkClient

client = ChalkClient()
response = client.list_scaling_groups()

for scaling_group in response.scaling_groups:
    print(f"Scaling group: {scaling_group.name}")
    print(f"  Status: {scaling_group.status}")

Inspecting

$ chalk scaling-group get --name=inference

Shows detailed information including the spec, replica status (desired, ready, available), tags, and URL.

sg = ScalingGroup.from_name("inference")
# or
sg = ScalingGroup.from_id("scaling-group-id")

info = sg.refresh()
print(info.status)             # 'Running', 'Available', 'Failed', ...
print(info.ready_replicas)
print(info.available_replicas)
print(sg.web_url)

Deleting

$ chalk scaling-group delete --name=inference

sg.delete()

Both surfaces remove the Kubernetes Deployment, Service, and HTTPRoute associated with the scaling group. The SDK additionally cleans up any temporary volumes created from add_local_file uploads. The database record is soft-deleted so the group still appears in history.

Entrypoint override

To override the image’s default entrypoint, pass entrypoint from the SDK or --entrypoint from the CLI:

sg = ScalingGroup(
    image="python:3.12",
    name="custom-server",
    port=8000,
    entrypoint=["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"],
).deploy()

$ chalk scaling-group create \
    --image=python:3.12 \
    --name=custom-server \
    --port=8000 \
    --entrypoint="python,-m,uvicorn,main:app,--host,0.0.0.0,--port,8000"

The CLI takes comma-separated arguments; the SDK takes a list.

Full example

A GPU-accelerated inference service with custom resource limits, multiple replicas, autoscaling, and tags:

$ chalk scaling-group create \
    --image=my-registry/llm-serving:v3 \
    --name=llm-inference \
    --port=8080 \
    --replicas=2 \
    --gpu=nvidia-l4:1 \
    --cpu=8 \
    --memory=32Gi \
    --tags="model=llama3,team=ml-platform" \
    --entrypoint="python,-m,vllm.entrypoints.openai.api_server,--model,/models/llama3"

After creation, the service is reachable at its assigned DNS name and can be referenced by resolvers or other services running in the same Chalk environment.

When to use a Scaling Group vs other compute primitives

See Choosing the right primitive on the Compute overview for the full decision table.

Scaling Groups

Quick start (SDK)

Quick start (CLI)

Replicas and autoscaling

CPU and memory resources

GPU support

DNS and routing

Invoking from the SDK

Environment, secrets, and volumes

Managing scaling groups

Listing

Inspecting

Deleting

Tags

Entrypoint override

Full example

When to use a Scaling Group vs other compute primitives

On this page

​Quick start (SDK)

​Quick start (CLI)

​Replicas and autoscaling

​CPU and memory resources

​GPU support

​DNS and routing

​Invoking from the SDK

​Environment, secrets, and volumes

​Managing scaling groups

​Listing

​Inspecting

​Deleting

​Tags

​Entrypoint override

​Full example

​When to use a Scaling Group vs other compute primitives

On this page

Quick start (SDK)

Quick start (CLI)

Replicas and autoscaling

CPU and memory resources

GPU support

DNS and routing

Invoking from the SDK

Environment, secrets, and volumes

Managing scaling groups

Listing

Inspecting

Deleting

Tags

Entrypoint override

Full example

When to use a Scaling Group vs other compute primitives