Chalk home page
  1. Features
  2. Feature Types


Features can be any primitive Python type:

from enum import Enum

class Genre(Enum):

class Book:
    id: int
    name: str
    publish_date: date
    copyright_ended_at: datetime | None
    genre: Genre

Lists and Sets

Features can be a list or set of scalar or struct types.

class Chapter:
    start_page: int
    end_page: int

class Book:
    authors: list[str]
    categories: set[str]
    chapters: list[Chapter]


Features can be a struct, which is a collection that maps a fixed set of keys to sub-fields. You can use dataclasses, Pydantic models, and attrs classes to represent struct features. Struct features can be used recursively within list features or other struct features.

Struct types should be used for objects that don’t have ids. If an object has an id, consider using has-one.


You can use any dataclass as a struct feature.

class JacketInfo:
    title: str
    subtitle: str
    body: str

class Book:
    id: int
    jacket_info: JacketInfo

Pydantic models

Pydantic models can also be used for stucts.

from pydantic import BaseModel, constr

class TitleInfo(BaseModel):
    heading: constr(min_length=2)
    subheading: Optional[str]

class Book:
    title: TitleInfo


Alternatively, you can use attrs.

import attrs

class TableOfContentsItem:
    foo: str
    bar: int

class Book:
    table_of_contents: list[TableOfContentsItem]


Both dataclass and Pydantic structs are implemented using the PyArrow serialization format, a high-performance schema for data serialization.
This data is stored “value only”, i.e. without keys, so any change to these structs over time will invalidate historical data. To support feature values where the schema changes over time, we introduced the Document struct type.
Documents are serialized as JSON and supports changes to schema over time, at the cost of a small performance penalty.

from pydantic import BaseModel
from chalk import Document

class AuthorInfo(BaseModel):
    first_name: str
    last_name: str

class Book:
    title: Document[AuthorInfo]


Features can be vectors (fixed sized arrays of floats), such as the output from an embedding model. Unlike list or set features, vector features are compatible with embedding functions and nearest neighbor similarity search.

from chalk.features import features, Vector

class Document:
    embedding: Vector[1536]  # Defines a vector with 1536 dimensions

When using the built-in embedding functions, then the vector dimensions don’t need to be specified, as Chalk will automatically infer it from the embedding model.

from chalk import features, embedding, vector

class Document:
    content: str
    # Chalk knows that text-embedding-ada-002 has 1536 dimensions
    embedding: Vector = embedding(

By default, vectors are persisted as 16 bit floats for efficiency. However, Chalk also supports persisting vectors as 32-bit or 64-bit floats via the keywords “f32” and “f64”, respectively.

from chalk import Vector

class Document:
    # Defines a vector with 1536 dimensions that will be persisted with 32 bit precision 
    embedding: Vector["f32", 1536]

Custom serializers

Finally, if you have an object that you want to serialize that isn’t from dataclass, attrs, or pydantic, you can write a custom codec. This custom codec must target a type that can be serialized to a PyArrow data type, which is the underlying serialization format for all features.

Consider the custom class below:

class CustomStruct:
    def __init__(self, foo: str, bar: int) -> None: = foo = bar

    def __eq__(self, other: object) -> bool:
        return (
            isinstance(other, CustomStruct)
            and ==
            and ==

    def __hash__(self) -> int:
        return hash((,

Here, we use the custom class as a feature, and provide an encoder and decoder. The encoder takes an instance of the custom type and outputs a Python object, and the decoder takes output of the encoder and creates an instance of the custom type

class Book:
    custom_field: CustomStruct = feature(
        encoder=lambda x: dict(,,
        decoder=lambda x: CustomStruct(**x),