Models
Learn how to train ML models in Chalk
Chalk provides single node and distributed training capabilities that leverage your existing feature infrastructure and generate Model Artifacts. These Model Artifacts can be registered into the Model Registry
Chalk supports two primary training approaches:
Both approaches integrate seamlessly with your existing Chalk feature infrastructure and automatically register trained models in the Model Registry and can be run from a Jupyter notebook or as part of a CI/CD pipeline. In this tutorial, we cover both training methods run through a jupyter notebook:
Single-node training is ideal for smaller datasets that can fit comfortably in memory. This approach is simpler to set up and debug, making it perfect for experimentation and smaller production workloads.
Here’s a complete example of single-node training with PyTorch. Use thechalk_train module for checkpointing and
accessing Chalk datasets.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.feature_extraction.text import TfidfVectorizer
import chalk.ml.train as chalk_train
class SpamClassifier(nn.Module):
def __init__(self, input_size):
super().__init__()
self.fc = nn.Sequential(nn.Linear(input_size, 32), nn.ReLU(), nn.Linear(32, 2))
def forward(self, x):
return self.fc(x)
# Define the preprocessing and data loading code
def preprocess_data(dataset_name: str, config: dict):
# Use Chalk Train to Load Dataset for Training
df = chalk_train.load_dataset(dataset_name=dataset_name).to_pandas()
data = df[['sms_spam.content']]
labels = df['sms_spam.label'].map({'ham': 1, 'spam': 0}).values
tfidf = TfidfVectorizer(max_features=50)
X = torch.tensor(tfidf.fit_transform(data).toarray(), dtype=torch.float32)
y = torch.tensor(labels, dtype=torch.long)
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=config["batch_size"], shuffle=True)
return dataloader, X.shape[1]
# Define the training code for the model
def train(config: dict):
dataloader, input_size = preprocess_data(dataset_name=config['dataset_name'], config=config)
model = SpamClassifier(input_size)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])
for epoch in range(config["num_epochs"]):
total_loss = 0
correct = 0
total = 0
for batch_X, batch_y in dataloader:
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
_, predicted = torch.max(outputs, 1)
correct += (predicted == batch_y).sum().item()
total += batch_y.size(0)
if epoch % 5 == 0:
accuracy = correct / total
print(f"Epoch {epoch}: Loss={total_loss/len(dataloader):.3f}, Acc={accuracy:.3f}")
chalk_train.checkpoint(
model,
metadata=dict(
accuracy=accuracy,
epoch=epoch,
)
)Once your train model is defined, you can run a training job using the client.train_model method:
from chalk import ResourceRequests
# Create a training run
training_run = client.train_model(
experiment_name="spam_model",
train_fn=train,
config=dict(
lr=0.01,
num_epochs=50,
batch_size=32,
dataset_name="sms_spam_dataset"
),
resources=ResourceRequests(
cpu=15,
memory="15Gi",
resource_group="gpu"
)
)If you’ve defined a resource group with GPU access, you can also leverage this for training by specifying the appropriate resource group.
After training your models, you can: