Scaling groups let you run long-lived, replicated services inside your Chalk environment’s Kubernetes cluster. Each scaling group is backed by a Kubernetes Deployment, so you get rolling updates, replica management, and automatic service discovery out of the box.

Common use cases include hosting ML inference servers, running sidecar services that your resolvers depend on, and deploying internal APIs that need to scale independently from your feature pipelines.


Creating a scaling group

Use the chalk scaling-group create command to launch a new scaling group. At a minimum you need an image, a name, and a port:

$ chalk scaling-group create \
    --image=my-registry/my-inference-server:v1 \
    --name=inference \
    --port=8080

This creates a Kubernetes Deployment with a single replica running your image, a headless Service, and an HTTPRoute that exposes the service over HTTPS.

You can tune replicas and resource limits at creation time:

$ chalk scaling-group create \
    --image=my-registry/my-inference-server:v1 \
    --name=inference \
    --port=8080 \
    --replicas=3 \
    --cpu=4 \
    --memory=8Gi

GPU support

Scaling groups support GPU-accelerated workloads. Use the --gpu flag with a type:count value to request GPU resources:

$ chalk scaling-group create \
    --image=my-registry/my-gpu-model:v1 \
    --name=gpu-inference \
    --port=8080 \
    --gpu=nvidia-tesla-t4:1 \
    --cpu=4 \
    --memory=16Gi

The type portion (e.g. nvidia-tesla-t4) is used to select the right node pool via a Kubernetes node selector (cloud.google.com/gke-accelerator), while the count sets the nvidia.com/gpu resource request and limit on the container. A toleration for nvidia.com/gpu is also applied so the pods can schedule onto GPU-tainted nodes.

If you don’t need to target a specific GPU type (for example on EKS where node selection is handled differently), you can pass just the count:

$ chalk scaling-group create \
    --image=my-registry/my-gpu-model:v1 \
    --name=gpu-inference \
    --port=8080 \
    --gpu=1 \
    --cpu=4 \
    --memory=16Gi

This sets the nvidia.com/gpu resource request without adding a GPU-type node selector.

Available GPU types depend on what node pools are configured in your cluster. Common GKE values include:

GPU typeDescription
nvidia-tesla-t4NVIDIA T4 (cost-effective inference)
nvidia-tesla-a100NVIDIA A100 (high-performance training and inference)
nvidia-l4NVIDIA L4 (balanced inference)
nvidia-tesla-v100NVIDIA V100 (training)

To request multiple GPUs, increase the count:

$ chalk scaling-group create \
    --image=my-registry/my-training-server:v1 \
    --name=multi-gpu \
    --port=8080 \
    --gpu=nvidia-tesla-a100:4

CPU and memory resources

The --cpu and --memory flags control the resource requests and limits for each replica. When specified, both the Kubernetes request and limit are set to the same value, guaranteeing your workload the resources it asks for.

$ chalk scaling-group create \
    --image=my-registry/my-app:v1 \
    --name=my-service \
    --port=8080 \
    --cpu=2 \
    --memory=4Gi

If you omit these flags, the following defaults are applied:

ResourceRequestLimit
CPU100m1
Memory256Mi1Gi

CPU values follow Kubernetes conventions: 1 means one full core, 500m means half a core. Memory values use standard suffixes: Mi (mebibytes) and Gi (gibibytes).


DNS and routing

Every scaling group is automatically assigned a DNS hostname and exposed over HTTPS through your cluster’s gateway. The hostname follows this pattern:

<environment-id>-<scaling-group-name>.<gateway-domain>

For example, a scaling group named inference in environment abc123 with gateway domain gw.chalk.ai would be reachable at:

https://abc123-inference.gw.chalk.ai

This hostname is shown as the Web URL in the output of chalk scaling-group get and chalk scaling-group list. Traffic arriving at this hostname is routed to the port you specified with --port.

Because the hostname includes the scaling group name, you can reference it by a stable, human-readable URL rather than an opaque UUID.


Managing scaling groups

Listing scaling groups

$ chalk scaling-group list

This displays a table of all scaling groups in your environment with their ID, name, image, status, replica counts, URL, and creation time.

Inspecting a scaling group

$ chalk scaling-group get --name=inference

This shows detailed information about the scaling group including its spec, replica status (desired, ready, available), tags, and URL.

Deleting a scaling group

$ chalk scaling-group delete --name=inference

This removes the Kubernetes Deployment, Service, and HTTPRoute associated with the scaling group. The database record is soft-deleted so it still appears in history.


Tags

You can attach arbitrary key-value tags to a scaling group for organization and filtering:

$ chalk scaling-group create \
    --image=my-registry/my-app:v1 \
    --name=my-service \
    --port=8080 \
    --tags="team=ml,version=2.1"

Tags are applied as Kubernetes labels with the prefix chalk.ai/tag-, making them visible in standard Kubernetes tooling.


Entrypoint override

If you need to override the Docker image’s default entrypoint, use the --entrypoint flag with comma-separated arguments:

$ chalk scaling-group create \
    --image=python:3.12 \
    --name=custom-server \
    --port=8000 \
    --entrypoint="python,-m,uvicorn,main:app,--host,0.0.0.0,--port,8000"

Full example

Here is a complete example that provisions a GPU-accelerated inference service with custom resource limits, multiple replicas, and tags:

$ chalk scaling-group create \
    --image=my-registry/llm-serving:v3 \
    --name=llm-inference \
    --port=8080 \
    --replicas=2 \
    --gpu=nvidia-l4:1 \
    --cpu=8 \
    --memory=32Gi \
    --tags="model=llama3,team=ml-platform" \
    --entrypoint="python,-m,vllm.entrypoints.openai.api_server,--model,/models/llama3"

After creation, the service is reachable at its assigned DNS name and can be referenced by resolvers or other services running in the same Chalk environment.