Chalk home page
Docs
SDK
CLI
  1. Integrations
  2. Integrations overview

Chalk integrates seamlessly with your underlying systems—querying your data sources directly, eliminating the need for ETL!

This unlocks several key benefits:

  • alleviates the need to move data across multiple systems
    • single source of truth (define once and use everywhere)
    • prevents data drift (same feature logic for offline and online workloads)
    • reduces data duplication and storage costs
  • optimizes compute by only ever fetching exactly what you need when you need it (dynamic query planning)
  • enables real-time data delivery by satisfying strict (under 5ms) latency requirements

Cloud Platforms

Anywhere that you can run Kubernetes, you can run Chalk—Chalk is cloud-agnostic.

Chalk deploys into your VPC co-located with your data sources for the lowest latency and cost. Multi-cloud deployments for high availability and disaster recovery.


SQL data sources and data warehouses

Chalk has native drivers and integrations with a variety of SQL data sources and query engines, and provide a unified interface for adding new data sources. Adding a new SQL source is as simple as providing a connection string and a few configuration options through your Chalk dashboard. Once it’s been added to your Chalk deployment, you can start querying it right away with SQL Resolvers.

-- resolves: User
-- source: postgres
select
    id,
    name,
from users

The features in a feature class can be hydrated from multiple SQL sources—we can pull a user’s social security number from a different database that has stricter access controls.

-- resolves: User
-- source: restricted_postgres
select
    id,
    ssn
from sensitive_user_data

In addition, Chalk can reverse ETL features from your data warehouses into Chalk’s online store for low-latency access. Chalk integrates natively (C++ integration) with the following data sources and pushes down filters and projections into SQL queries for more efficient data fetching.

Data Warehouses

Native:

AWS:

GCP:

Azure:

  • Database for PostgreSQL
  • Database for MySQL

Streaming / Real-Time Data Systems

We provide stream resolvers for integrating Kafka compatible systems data sources.

  • Kafka compatible
    • Confluent
    • Redpanda
  • Kinesis (AWS)
  • Pub/Sub (GCP)
  • Event Hubs (Azure)

Streams can also be filtered, processed, and materialized as a step in Chalk’s feature computation pipelines.

@stream(source=KafkaSource(name='transactions_stream'))
def process_transaction_topic(
    value: TransactionMsg,
) -> Features[Transaction.id, Transaction.user_id, Transaction.amount]:
    return Transaction(
        id=value.id,
        user_id=value.user_id,
        amount=value.amount,
    )

Feature caching with expensive features with Redis/Valkey, Memcached, DynamoDB, and more

Chalk makes it easy to cache features for low-latency access with the max_staleness keyword argument. These features skip expensive API calls and are fetched from the online store.

@feature
class User: 
    id: int
    name: str
    ssn: int
    credit_score: int = feature(max_staleness="30d")

We support a variety of caching backends:

  • Redis / Valkey
  • Memcached
  • DynamoDB
  • Amazon ElastiCache
  • Google Cloud Memorystore
  • Azure Cache for Redis
  • Azure Cosmos DB

APIs & Microservices

Call internal APIs, third-party services, and microservices with built-in retry logic and circuit breakers:

@online
def get_credit_score(ssn: User.ssn) -> User.credit_score:
    response = requests.get(
        f"https://api.creditbureau.com/score/{ssn}",
        headers={"Authorization": f"Bearer {API_KEY}"},
        timeout=2.0
    )
    return response.json()["score"]

Chalk’s Symbolic Python Interpreter supports accelerating libraries like requests, and so this function gets run in C++.


Object Storage and Iceberg Catalogs

AWS (Amazon Web Services)

GCP (Google Cloud)

  • Google Cloud Storage
  • Cloud Data Catalog
  • BigLake

Microsoft

  • Azure Blob Storage
  • Microsoft Purview
  • Azure Data Lake

Chalk is Iceberg native and can write to your underlying object storage and catalog directly from offline queries.

from chalk.integrations import GlueCatalog

catalog = GlueCatalog(
    name="aws_glue_catalog",
    aws_region="us-west-2",
    catalog_id="123",
    aws_role_arn="arn:aws:iam::123456789012:role/YourCatalogueAccessRole",
)
results.write_to(destination="database.table_name", catalog=catalog)

AI & ML Services

Access traditional machine learning functions like Scikit, XGBoost, and your own models directly within feature definitions using Chalk Expressions:

  • Sci-kit functions
  • ONNX Models (Open Neural Network Exchange) through Chalk’s model registry

Integrating unstructured data with LLMS (large language models) or computing embeddings is straightforward with Chalk’s built-in integrations. Easily conduct Evals, switch out different models and providers, and reference the features you need in your prompts without having to configure complex pipelines.

  • OpenAI
  • Anthropic
  • AWS Bedrock
  • Azure OpenAI
  • Google Vertex
  • Any OpenAI compatible chat completion model
    • Cerebras
    • Groq
    • Ollama Cloud
    • Together.ai

You can override the base url and API key to connect to any OpenAI compatible endpoint.

@features
class Item:
    id: int
    title: str
    description: str
    llm: P.PromptResponse = P.completion(
        model="gpt-4o-mini-2024-07-18",
        messages=[
            P.message(
                role="user",
                content=F.jinja(
                    """
                    Classify the following item category using its title and description:
                    Item title: {{ Item.title }}
                    Item description: {{ Item.description }}
                    """,
                ),
            ),
        ],
        output_structure=StructuredOutput,
    )

You can just as easily compute embeddings for items, users, or any other entity using built-in integrations:

@features
class VectorSearch:
    q: Primary[str]
    # from chalk.features import embed
    vector: Vector = embed(
        input=lambda: VectorSearch.q,
        provider="vertexai",
        model="text-embedding-005",
    )
    query_type: str = "vector"

    results: "DataFrame[ItemDocument]"

Get started today

With dozens of native integrations across cloud platforms, databases, streaming systems, caching layers, and AI services, Chalk eliminates the complexity of building and maintaining production machine learning systems.

Whether you’re pulling user data from PostgreSQL, processing real-time events from Kafka, caching expensive feature computations in Redis, or extracting features from unstructured data with LLM’s-—Chalk’s unified platform handles it all.

The result? Faster time to production, lower operational overhead, and consistent feature logic across your entire ML stack.