With Chalk, you can deploy machine learning models as isolated services running in dedicated scaling groups. This approach allows your models to run with their own compute resources, auto-scaling policies, and independent lifecycle management—separate from the Chalk engine itself.

This is different from the traditional approach of including models directly in Chalk feature resolvers. Instead of embedding model inference within your feature computation, model deployments host your models as standalone services that can be called from resolvers or external applications.


When to Use Model Deployments

Model deployments are ideal when you want to:

  • Isolate model resources: Give models their own CPU, memory, and GPU resources independent of the engine
  • Scale models independently: Auto-scale models based on inference demand without affecting other services
  • Version and update models separately: Deploy new model versions without redeploying your entire Chalk system
  • Run containerized models: Deploy models as Docker images without converting to Python objects
  • Enable high-throughput inference: Run multiple replicas of your model in parallel

Registering Models for Deployment

To deploy models to scaling groups, register them with a Docker image reference instead of local model files or Python model objects.

from chalk.client import ChalkClient
import pyarrow as pa

client = ChalkClient()

# Register the model version with a Docker image
client.register_model_version(
    name="my-model",
    input_schema={"text": pa.large_string()},
    output_schema={"entities": pa.large_string()},
    model_image="my-model-image:latest",
)

Deploying to Scaling Groups

Once registered, deploy a model version to a scaling group with resource specifications and auto-scaling policies.

from chalk.client import ChalkClient
from chalk.scaling import AutoScalingSpec, ScalingGroupResourceRequest

client = ChalkClient()

# Deploy the model version to a scaling group
scaling_group = client.deploy_model_version_to_scaling_group(
    name="my-model-scaling-group",
    model_name="my-model",
    model_version=1,
    scaling=AutoScalingSpec(
        min_replicas=1,
        max_replicas=2,
        target_cpu_utilization_percentage=70,
    ),
    resources=ScalingGroupResourceRequest(
        cpu="2",
        memory="4Gi",
    ),
)

Auto-Scaling Configuration

Control how your model deployment scales based on demand using AutoScalingSpec.

from chalk.scaling import AutoScalingSpec

# Configure auto-scaling behavior
scaling = AutoScalingSpec(
    min_replicas=1,                          # Minimum number of replicas
    max_replicas=5,                          # Maximum number of replicas
    target_cpu_utilization_percentage=70,    # Target CPU utilization (optional)
)

Chalk automatically scales the number of replicas based on inference request load and CPU utilization, staying within your min/max bounds. This ensures your models handle traffic spikes efficiently without wasting resources during quiet periods.


Resource Configuration

Specify CPU, memory, and GPU resources for each replica of your model using ScalingGroupResourceRequest.

from chalk.scaling import ScalingGroupResourceRequest

# Request resources per replica
resources = ScalingGroupResourceRequest(
    cpu="2",                          # CPU allocation per replica
    memory="4Gi",                     # Memory allocation per replica
    gpu="nvidia-tesla-t4:1",          # Optional: GPU type and count
)

Each replica gets the specified resources. When Chalk scales from 1 to 3 replicas, total resource usage is multiplied accordingly (e.g., 3 replicas × 2 CPU = 6 CPU total).


Calling Deployed Models

Models deployed to scaling groups can be called from Chalk feature resolvers using the catalog_call function with the scaling group name.

from chalk.features import features, feature, _
from chalk import functions as F

@features
class Document:
    id: int
    text: str
    entities: str = F.catalog_call(
        "model.my-model-scaling-group",
        _.text
    )

The catalog call format is: model.{scaling_group_name}

You can pass multiple inputs by providing them as additional arguments:

@features
class Request:
    id: int
    input_a: str
    input_b: str
    output: str = F.catalog_call(
        "model.my-model-scaling-group",
        _.input_a,
        _.input_b
    )

The order of arguments must match the order of fields in your model’s input_schema.


Managing Deployments

Updating a Deployment

Deploy a new version of a model to an existing scaling group:

# Register a new model version
new_version = client.register_model_version(
    name="my-model",
    model_image="my-model:v2.0",
    input_schema={"text": pa.large_string()},
    output_schema={"entities": pa.large_string()},
)

# Update the scaling group with the new version
client.deploy_model_version_to_scaling_group(
    name="my-model-scaling-group",
    model_name="my-model",
    model_version=new_version.model_version,
)

For more information on listing, inspecting, and deleting scaling groups, see the Scaling Groups page.

Structuring Your Model Deployment Code

Model registration and deployment should be controlled manually and separately from your feature definitions. Either:

  1. Add to .chalkignore to prevent them from running during chalk apply.
  2. Run in a separate repository dedicated to model management, keeping it independent from your Chalk feature code.

Your chalk apply will fail if it tries to run model registration and deployment code.

Organize your project to keep model management separate from feature definitions:

my-chalk-project/
|- models/                          # Model deployment code (add to .chalkignore)
|  |- Dockerfile
|  |- model.py
|  |- requirements.txt
|  `- deploy_model.py               # Registration + deployment script
|
|- features/                        # Feature definitions (synced with chalk apply)
|  |- __init__.py
|  `- user_features.py
|
|- .chalkignore
`- chalk.yaml

Put the following line in your .chalkignore so chalk apply skips everything under models/.

models/

Docker Image Requirements

Model deployments use the chalk-remote-call-python shim to handle request routing and PyArrow serialization. Your Docker image should:

  1. Install chalk-remote-call-python: Provides the server and request handling
  2. Define a handler function: Receives PyArrow Arrays, returns PyArrow Arrays
  3. Optionally define on_startup: Initialize resources like loading models
  4. Use chalk-remote-call as entrypoint: Runs your handler on a specified port

Example: NER Model

Here’s a complete example using spaCy for named entity recognition:

Dockerfile:

FROM python:3.11-slim

WORKDIR /app

RUN pip install --no-cache-dir chalk-remote-call-python spacy
RUN python -m spacy download en_core_web_sm

COPY model.py /app/model.py

ENV PYTHONPATH=/app

EXPOSE 8080

ENTRYPOINT ["chalk-remote-call", "--handler", "model.handler", "--port", "8080"]

Build and push to a registry:

docker build --platform linux/amd64 -t my-model:latest .
docker push my-model:latest

model.py:

"""Model using spaCy — chalk-remote-call handler convention.

This is an example customer model that uses chalk-remote-call-python.
The handler receives PyArrow Arrays and returns results.
"""

import json
import pyarrow as pa
import spacy

nlp = None


def on_startup():
    """Load the spaCy model once at startup."""
    global nlp
    print("Loading model...")
    nlp = spacy.load("en_core_web_sm")
    print("Model loaded!")


def handler(event: dict[str, pa.Array], context: dict) -> pa.Array:
    """Extract named entities from text.

    Parameters
    ----------
    event
        Dictionary of PyArrow Arrays. Keys match your input_schema.
        Example: {"text": pa.Array of strings}
    context
        Request metadata (peer address, headers, etc.)

    Returns
    -------
    pa.Array
        Output must be a single PyArrow Array matching your output_schema
    """
    texts = event["text"].to_pylist()
    results = []

    for text, doc in zip(texts, nlp.pipe(texts, batch_size=32)):
        if text is None:
            results.append(None)
            continue

        entities = [
            {
                "text": ent.text,
                "label": ent.label_,
                "start": ent.start_char,
                "end": ent.end_char,
            }
            for ent in doc.ents
        ]

        results.append(json.dumps({"text": text, "entities": entities}))

    return pa.array(results, type=pa.utf8())

Handler Function Signature

Your handler function must follow this signature:

def handler(event: dict[str, pa.Array], context: dict) -> pa.Array:
    """
    Parameters
    ----------
    event : dict[str, pa.Array]
        Input data as PyArrow Arrays. Keys correspond to your input_schema fields.
    context : dict
        Request metadata (peer address, headers, etc.)

    Returns
    -------
    pa.Array
        Single PyArrow Array output matching your output_schema
    """
    # Your model inference logic here
    pass

Optional Startup Hook

Define an on_startup() function to initialize resources when the container starts:

def on_startup():
    """Called once when the model service starts."""
    global nlp
    print("Initializing model...")
    nlp = spacy.load("en_core_web_sm")
    print("Model ready!")

This is useful for:

  • Loading large models into memory
  • Establishing database connections
  • Warming up caches
  • One-time initialization tasks

Benefits of Model Deployments

  • Resource isolation: Models don’t compete with the engine for compute
  • Independent scaling: Scale models up or down based on their specific load
  • Easy updates: Deploy new model versions without downtime
  • Language agnostic: Run models in any language/framework as Docker containers
  • Observability: Monitor each model’s performance and resource usage separately