Scaling Groups

Scaling groups let you run long-lived, replicated services inside your Chalk environment’s Kubernetes cluster. Each scaling group is backed by a Kubernetes Deployment, so you get rolling updates, replica management, and automatic service discovery out of the box.

Common use cases include hosting ML inference servers, running sidecar services that your resolvers depend on, and deploying internal APIs that need to scale independently from your feature pipelines.

Creating a scaling group

Use the chalk scaling-group create command to launch a new scaling group. At a minimum you need an image, a name, and a port:

$ chalk scaling-group create \
    --image=my-registry/my-inference-server:v1 \
    --name=inference \
    --port=8080

This creates a Kubernetes Deployment with a single replica running your image, a headless Service, and an HTTPRoute that exposes the service over HTTPS.

You can tune replicas and resource limits at creation time:

$ chalk scaling-group create \
    --image=my-registry/my-inference-server:v1 \
    --name=inference \
    --port=8080 \
    --replicas=3 \
    --cpu=4 \
    --memory=8Gi

GPU support

Scaling groups support GPU-accelerated workloads. Use the --gpu flag with a type:count value to request GPU resources:

$ chalk scaling-group create \
    --image=my-registry/my-gpu-model:v1 \
    --name=gpu-inference \
    --port=8080 \
    --gpu=nvidia-tesla-t4:1 \
    --cpu=4 \
    --memory=16Gi

The type portion (e.g. nvidia-tesla-t4) is used to select the right node pool via a Kubernetes node selector (cloud.google.com/gke-accelerator), while the count sets the nvidia.com/gpu resource request and limit on the container. A toleration for nvidia.com/gpu is also applied so the pods can schedule onto GPU-tainted nodes.

If you don’t need to target a specific GPU type (for example on EKS where node selection is handled differently), you can pass just the count:

$ chalk scaling-group create \
    --image=my-registry/my-gpu-model:v1 \
    --name=gpu-inference \
    --port=8080 \
    --gpu=1 \
    --cpu=4 \
    --memory=16Gi

This sets the nvidia.com/gpu resource request without adding a GPU-type node selector.

Available GPU types depend on what node pools are configured in your cluster. Common GKE values include:

GPU type	Description
`nvidia-tesla-t4`	NVIDIA T4 (cost-effective inference)
`nvidia-tesla-a100`	NVIDIA A100 (high-performance training and inference)
`nvidia-l4`	NVIDIA L4 (balanced inference)
`nvidia-tesla-v100`	NVIDIA V100 (training)

To request multiple GPUs, increase the count:

$ chalk scaling-group create \
    --image=my-registry/my-training-server:v1 \
    --name=multi-gpu \
    --port=8080 \
    --gpu=nvidia-tesla-a100:4

CPU and memory resources

The --cpu and --memory flags control the resource requests and limits for each replica. When specified, both the Kubernetes request and limit are set to the same value, guaranteeing your workload the resources it asks for.

$ chalk scaling-group create \
    --image=my-registry/my-app:v1 \
    --name=my-service \
    --port=8080 \
    --cpu=2 \
    --memory=4Gi

If you omit these flags, the following defaults are applied:

Resource	Request	Limit
CPU	`100m`	`1`
Memory	`256Mi`	`1Gi`

CPU values follow Kubernetes conventions: 1 means one full core, 500m means half a core. Memory values use standard suffixes: Mi (mebibytes) and Gi (gibibytes).

DNS and routing

Every scaling group is automatically assigned a DNS hostname and exposed over HTTPS through your cluster’s gateway. The hostname follows this pattern:

<environment-id>-<scaling-group-name>.<gateway-domain>

For example, a scaling group named inference in environment abc123 with gateway domain gw.chalk.ai would be reachable at:

https://abc123-inference.gw.chalk.ai

This hostname is shown as the Web URL in the output of chalk scaling-group get and chalk scaling-group list. Traffic arriving at this hostname is routed to the port you specified with --port.

Because the hostname includes the scaling group name, you can reference it by a stable, human-readable URL rather than an opaque UUID.

Managing scaling groups

Listing scaling groups

Using the CLI:

$ chalk scaling-group list

This displays a table of all scaling groups in your environment with their ID, name, image, status, replica counts, URL, and creation time.

Using the Python client:

from chalk.client import ChalkClient

client = ChalkClient()
response = client.list_scaling_groups()

for scaling_group in response.scaling_groups:
    print(f"Scaling group: {scaling_group.name}")
    print(f"  Model: {scaling_group.model_name}")
    print(f"  Status: {scaling_group.status}")

Inspecting a scaling group

Using the CLI:

$ chalk scaling-group get --name=inference

This shows detailed information about the scaling group including its spec, replica status (desired, ready, available), tags, and URL.

Using the Python client:

from chalk.client import ChalkClient

client = ChalkClient()

# Get by name
scaling_group = client.get_scaling_group(name="inference")

# Or get by ID
scaling_group = client.get_scaling_group(id="scaling-group-id")

print(f"Replicas: {scaling_group.replicas}")
print(f"Resources: {scaling_group.resources}")

Deleting a scaling group

Using the CLI:

$ chalk scaling-group delete --name=inference

This removes the Kubernetes Deployment, Service, and HTTPRoute associated with the scaling group. The database record is soft-deleted so it still appears in history.

Using the Python client:

from chalk.client import ChalkClient

client = ChalkClient()

# Delete by name
response = client.delete_scaling_group(name="inference")

# Or delete by ID
response = client.delete_scaling_group(id="scaling-group-id")

print(f"Deleted: {response.deleted_scaling_group}")

Entrypoint override

If you need to override the Docker image’s default entrypoint, use the --entrypoint flag with comma-separated arguments:

$ chalk scaling-group create \
    --image=python:3.12 \
    --name=custom-server \
    --port=8000 \
    --entrypoint="python,-m,uvicorn,main:app,--host,0.0.0.0,--port,8000"

Full example

Here is a complete example that provisions a GPU-accelerated inference service with custom resource limits, multiple replicas, and tags:

$ chalk scaling-group create \
    --image=my-registry/llm-serving:v3 \
    --name=llm-inference \
    --port=8080 \
    --replicas=2 \
    --gpu=nvidia-l4:1 \
    --cpu=8 \
    --memory=32Gi \
    --tags="model=llama3,team=ml-platform" \
    --entrypoint="python,-m,vllm.entrypoints.openai.api_server,--model,/models/llama3"

After creation, the service is reachable at its assigned DNS name and can be referenced by resolvers or other services running in the same Chalk environment.

Scaling Groups

Creating a scaling group

GPU support

CPU and memory resources

DNS and routing

Managing scaling groups

Listing scaling groups

Inspecting a scaling group

Deleting a scaling group

Tags

Entrypoint override

Full example

On this page

​Creating a scaling group

​GPU support

​CPU and memory resources

​DNS and routing

​Managing scaling groups

​Listing scaling groups

​Inspecting a scaling group

​Deleting a scaling group

​Tags

​Entrypoint override

​Full example

On this page

Creating a scaling group

GPU support

CPU and memory resources

DNS and routing

Managing scaling groups

Listing scaling groups

Inspecting a scaling group

Deleting a scaling group

Tags

Entrypoint override

Full example