Model Training

Chalk provides single node and distributed training capabilities that leverage your existing feature infrastructure and generate Model Artifacts. These Model Artifacts can be registered into the Model Registry

Training Overview

Chalk supports two primary training approaches:

Single-Node Training: For smaller datasets that fit in memory on a single machine
Distributed Training: For large-scale datasets that require distributed processing across multiple workers

Both approaches integrate seamlessly with your existing Chalk feature infrastructure and automatically register trained models in the Model Registry and can be run from a Jupyter notebook or as part of a CI/CD pipeline. In this tutorial, we cover both training methods run through a jupyter notebook:

Single-Node Training

Single-node training is ideal for smaller datasets that can fit comfortably in memory. This approach is simpler to set up and debug, making it perfect for experimentation and smaller production workloads.

Basic Training Setup

Here’s a complete example of single-node training with PyTorch. Use thechalk_train module for checkpointing and accessing Chalk datasets.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.feature_extraction.text import TfidfVectorizer

import chalk.ml.train as chalk_train


class SpamClassifier(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.fc = nn.Sequential(nn.Linear(input_size, 32), nn.ReLU(), nn.Linear(32, 2))

    def forward(self, x):
        return self.fc(x)


# Define the preprocessing and data loading code
def preprocess_data(dataset_name: str, config: dict):
    # Use Chalk Train to Load Dataset for Training
    df = chalk_train.load_dataset(dataset_name=dataset_name).to_pandas()

    data = df[['sms_spam.content']]
    labels = df['sms_spam.label'].map({'ham': 1, 'spam': 0}).values

    tfidf = TfidfVectorizer(max_features=50)

    X = torch.tensor(tfidf.fit_transform(data).toarray(), dtype=torch.float32)
    y = torch.tensor(labels, dtype=torch.long)

    dataset = TensorDataset(X, y)
    dataloader = DataLoader(dataset, batch_size=config["batch_size"], shuffle=True)
    return dataloader, X.shape[1]


# Define the training code for the model
def train(config: dict):
    dataloader, input_size = preprocess_data(dataset_name=config['dataset_name'], config=config)
    model = SpamClassifier(input_size)

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])

    for epoch in range(config["num_epochs"]):
        total_loss = 0
        correct = 0
        total = 0

        for batch_X, batch_y in dataloader:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == batch_y).sum().item()
            total += batch_y.size(0)

        if epoch % 5 == 0:
            accuracy = correct / total

            print(f"Epoch {epoch}: Loss={total_loss/len(dataloader):.3f}, Acc={accuracy:.3f}")

            chalk_train.checkpoint(
                model,
                metadata=dict(
                    accuracy=accuracy,
                    epoch=epoch,
                )
            )

Once your train model is defined, you can run a training job using the client.train_model method:

from chalk import ResourceRequests

# Create a training run
training_run = client.train_model(
    experiment_name="spam_model",
    train_fn=train,
    config=dict(
        lr=0.01,
        num_epochs=50,
        batch_size=32,
        dataset_name="sms_spam_dataset"
    ),
    resources=ResourceRequests(
        cpu=15,
        memory="15Gi",
        resource_group="gpu"
    )
)

If you’ve defined a resource group with GPU access, you can also leverage this for training by specifying the appropriate resource group.

Next Steps

After training your models, you can:

Register Models: Use the Model Registry to version and track your trained models
Deploy Models: Load models into Chalk deployments for inference
Monitor Performance: Track model performance and feature distributions over time
Iterate: Use the training infrastructure to experiment with new architectures and hyperparameters

​Training Overview

​Single-Node Training

​Basic Training Setup

​Next Steps

On this page

Training Overview

Single-Node Training

Basic Training Setup

Next Steps