Infrastructure
Run long-lived services with replicas, GPU support, and automatic DNS routing in your Chalk environment.
Scaling groups let you run long-lived, replicated services inside your Chalk environment’s Kubernetes cluster. Each scaling group is backed by a Kubernetes Deployment, so you get rolling updates, replica management, and automatic service discovery out of the box.
Common use cases include hosting ML inference servers, running sidecar services that your resolvers depend on, and deploying internal APIs that need to scale independently from your feature pipelines.
Use the chalk scaling-group create command to launch a new scaling group. At a minimum
you need an image, a name, and a port:
$ chalk scaling-group create \
--image=my-registry/my-inference-server:v1 \
--name=inference \
--port=8080This creates a Kubernetes Deployment with a single replica running your image, a headless Service, and an HTTPRoute that exposes the service over HTTPS.
You can tune replicas and resource limits at creation time:
$ chalk scaling-group create \
--image=my-registry/my-inference-server:v1 \
--name=inference \
--port=8080 \
--replicas=3 \
--cpu=4 \
--memory=8GiScaling groups support GPU-accelerated workloads. Use the --gpu flag with a
type:count value to request GPU resources:
$ chalk scaling-group create \
--image=my-registry/my-gpu-model:v1 \
--name=gpu-inference \
--port=8080 \
--gpu=nvidia-tesla-t4:1 \
--cpu=4 \
--memory=16GiThe type portion (e.g. nvidia-tesla-t4) is used to select the right node pool via
a Kubernetes node selector (cloud.google.com/gke-accelerator), while the count
sets the nvidia.com/gpu resource request and limit on the container. A toleration for
nvidia.com/gpu is also applied so the pods can schedule onto GPU-tainted nodes.
If you don’t need to target a specific GPU type (for example on EKS where node selection is handled differently), you can pass just the count:
$ chalk scaling-group create \
--image=my-registry/my-gpu-model:v1 \
--name=gpu-inference \
--port=8080 \
--gpu=1 \
--cpu=4 \
--memory=16GiThis sets the nvidia.com/gpu resource request without adding a GPU-type node selector.
Available GPU types depend on what node pools are configured in your cluster. Common GKE values include:
| GPU type | Description |
|---|---|
nvidia-tesla-t4 | NVIDIA T4 (cost-effective inference) |
nvidia-tesla-a100 | NVIDIA A100 (high-performance training and inference) |
nvidia-l4 | NVIDIA L4 (balanced inference) |
nvidia-tesla-v100 | NVIDIA V100 (training) |
To request multiple GPUs, increase the count:
$ chalk scaling-group create \
--image=my-registry/my-training-server:v1 \
--name=multi-gpu \
--port=8080 \
--gpu=nvidia-tesla-a100:4The --cpu and --memory flags control the resource requests and limits for each
replica. When specified, both the Kubernetes request and limit are set to the same
value, guaranteeing your workload the resources it asks for.
$ chalk scaling-group create \
--image=my-registry/my-app:v1 \
--name=my-service \
--port=8080 \
--cpu=2 \
--memory=4GiIf you omit these flags, the following defaults are applied:
| Resource | Request | Limit |
|---|---|---|
| CPU | 100m | 1 |
| Memory | 256Mi | 1Gi |
CPU values follow Kubernetes conventions: 1 means one full core, 500m means half a
core. Memory values use standard suffixes: Mi (mebibytes) and Gi (gibibytes).
Every scaling group is automatically assigned a DNS hostname and exposed over HTTPS through your cluster’s gateway. The hostname follows this pattern:
<environment-id>-<scaling-group-name>.<gateway-domain>
For example, a scaling group named inference in environment abc123 with gateway
domain gw.chalk.ai would be reachable at:
https://abc123-inference.gw.chalk.ai
This hostname is shown as the Web URL in the output of chalk scaling-group get
and chalk scaling-group list. Traffic arriving at this hostname is routed to the
port you specified with --port.
Because the hostname includes the scaling group name, you can reference it by a stable, human-readable URL rather than an opaque UUID.
$ chalk scaling-group listThis displays a table of all scaling groups in your environment with their ID, name, image, status, replica counts, URL, and creation time.
$ chalk scaling-group get --name=inferenceThis shows detailed information about the scaling group including its spec, replica status (desired, ready, available), tags, and URL.
$ chalk scaling-group delete --name=inferenceThis removes the Kubernetes Deployment, Service, and HTTPRoute associated with the scaling group. The database record is soft-deleted so it still appears in history.
You can attach arbitrary key-value tags to a scaling group for organization and filtering:
$ chalk scaling-group create \
--image=my-registry/my-app:v1 \
--name=my-service \
--port=8080 \
--tags="team=ml,version=2.1"Tags are applied as Kubernetes labels with the prefix chalk.ai/tag-, making them
visible in standard Kubernetes tooling.
If you need to override the Docker image’s default entrypoint, use the --entrypoint
flag with comma-separated arguments:
$ chalk scaling-group create \
--image=python:3.12 \
--name=custom-server \
--port=8000 \
--entrypoint="python,-m,uvicorn,main:app,--host,0.0.0.0,--port,8000"Here is a complete example that provisions a GPU-accelerated inference service with custom resource limits, multiple replicas, and tags:
$ chalk scaling-group create \
--image=my-registry/llm-serving:v3 \
--name=llm-inference \
--port=8080 \
--replicas=2 \
--gpu=nvidia-l4:1 \
--cpu=8 \
--memory=32Gi \
--tags="model=llama3,team=ml-platform" \
--entrypoint="python,-m,vllm.entrypoints.openai.api_server,--model,/models/llama3"After creation, the service is reachable at its assigned DNS name and can be referenced by resolvers or other services running in the same Chalk environment.