Chalk home page
Docs
API
CLI
  1. Features
  2. Feature Types

Scalars

Features can be any primitive Python type:

from enum import Enum

class Genre(Enum):
    FICTION = "FICTION"
    NONFICTION = "NONFICTION"
    DRAMA = "DRAMA"
    POETRY = "POETRY"

@features
class Book:
    id: int
    name: str
    publish_date: date
    copyright_ended_at: datetime | None
    genre: Genre

Lists and Sets

Features can be a list or set of scalar or struct types.

@dataclass
class Chapter:
    start_page: int
    end_page: int

@features
class Book:
    authors: list[str]
    categories: set[str]
    chapters: list[Chapter]

Structs

Features can be a struct, which is a collection that maps a fixed set of keys to sub-fields. You can use dataclasses, Pydantic models, and attrs classes to represent struct features. Struct features can be used recursively within list features or other struct features.

Struct types should be used for objects that don’t have ids. If an object has an id, consider using has-one.

Dataclass

You can use any dataclass as a struct feature.

@dataclass
class JacketInfo:
    title: str
    subtitle: str
    body: str

@features
class Book:
    id: int
    jacket_info: JacketInfo

Pydantic models

Pydantic models can also be used for stucts.

from pydantic import BaseModel, constr

class TitleInfo(BaseModel):
    heading: constr(min_length=2)
    subheading: Optional[str]

@features
class Book:
    title: TitleInfo
    ...

Attrs

Alternatively, you can use attrs.

import attrs

@attrs.define
class TableOfContentsItem:
    foo: str
    bar: int

@features
class Book:
    table_of_contents: list[TableOfContentsItem]
    ...

Document

Both dataclass and Pydantic structs are implemented using the PyArrow serialization format, a high-performance schema for data serialization.
This data is stored “value only”, i.e. without keys, so any change to these structs over time will invalidate historical data. To support feature values where the schema changes over time, we introduced the Document struct type.
Documents are serialized as JSON and supports changes to schema over time, at the cost of a small performance penalty.

from pydantic import BaseModel
from chalk import Document

class AuthorInfo(BaseModel):
    first_name: str
    last_name: str

@features
class Book:
    title: Document[AuthorInfo]
    ...

Vectors

Features can be vectors (fixed sized arrays of floats), such as the output from an embedding model. Unlike list or set features, vector features are compatible with embedding functions and nearest neighbor similarity search.

from chalk.features import features, Vector

@features
class Document:
    embedding: Vector[1536]  # Defines a vector with 1536 dimensions

When using the built-in embedding functions, then the vector dimensions don’t need to be specified, as Chalk will automatically infer it from the embedding model.

from chalk import features, embedding, vector

@features
class Document:
    content: str
    # Chalk knows that text-embedding-ada-002 has 1536 dimensions
    embedding: Vector = embedding(
        feature="content", 
        provider="openai", 
        model="text-embedding-ada-002"
    )

By default, vectors are persisted as 16 bit floats for efficiency. However, Chalk also supports persisting vectors as 32-bit or 64-bit floats via the keywords “f32” and “f64”, respectively.

from chalk import Vector

@features
class Document:
    # Defines a vector with 1536 dimensions that will be persisted with 32 bit precision 
    embedding: Vector["f32", 1536]

Custom serializers

Finally, if you have an object that you want to serialize that isn’t from dataclass, attrs, or pydantic, you can write a custom codec. This custom codec must target a type that can be serialized to a PyArrow data type, which is the underlying serialization format for all features.

Consider the custom class below:

class CustomStruct:
    def __init__(self, foo: str, bar: int) -> None:
        self.foo = foo
        self.bar = bar

    def __eq__(self, other: object) -> bool:
        return (
            isinstance(other, CustomStruct)
            and self.foo == other.bar
            and self.bar == other.bar
        )

    def __hash__(self) -> int:
        return hash((self.foo, self.bar))

Here, we use the custom class as a feature, and provide an encoder and decoder. The encoder takes an instance of the custom type and outputs a Python object, and the decoder takes output of the encoder and creates an instance of the custom type

@features
class Book:
    custom_field: CustomStruct = feature(
        encoder=lambda x: dict(foo=x.foo, bar=x.bar),
        decoder=lambda x: CustomStruct(**x),
    )