Features can be any primitive Python type:
from enum import Enum class Genre(Enum): FICTION = "FICTION" NONFICTION = "NONFICTION" DRAMA = "DRAMA" POETRY = "POETRY" @features class Book: id: int name: str publish_date: date copyright_ended_at: datetime | None genre: Genre
@dataclass class Chapter: start_page: int end_page: int @features class Book: authors: list[str] categories: set[str] chapters: list[Chapter]
Features can be a struct, which is a collection that maps a fixed set of keys to sub-fields. You can use dataclasses, Pydantic models, and attrs classes to represent struct features. Struct features can be used recursively within list features or other struct features.
Struct types should be used for objects that don’t have ids. If an object has an id, consider using has-one.
You can use any
dataclass as a struct feature.
@dataclass class JacketInfo: title: str subtitle: str body: str @features class Book: id: int jacket_info: JacketInfo
Pydantic models can also be used for stucts.
from pydantic import BaseModel, constr class TitleInfo(BaseModel): heading: constr(min_length=2) subheading: Optional[str] @features class Book: title: TitleInfo ...
Alternatively, you can use
import attrs @attrs.define class TableOfContentsItem: foo: str bar: int @features class Book: table_of_contents: list[TableOfContentsItem] ...
Both dataclass and Pydantic
structs are implemented using the PyArrow
serialization format, a high-performance schema for data serialization.
This data is stored “value only”, i.e. without keys, so any change to these structs over time will invalidate historical data. To support feature values where the schema changes over time, we introduced the
Document struct type.
Documents are serialized as JSON and supports changes to schema over time, at the cost of a small performance penalty.
from pydantic import BaseModel from chalk import Document class AuthorInfo(BaseModel): first_name: str last_name: str @features class Book: title: Document[AuthorInfo] ...
Features can be vectors (fixed sized arrays of floats), such as the output from an embedding model. Unlike list or set features, vector features are compatible with embedding functions and nearest neighbor similarity search.
from chalk.features import features, Vector @features class Document: embedding: Vector # Defines a vector with 1536 dimensions
When using the built-in embedding functions, then the vector dimensions don’t need to be specified, as Chalk will automatically infer it from the embedding model.
from chalk import features, embedding, vector @features class Document: content: str # Chalk knows that text-embedding-ada-002 has 1536 dimensions embedding: Vector = embedding( feature="content", provider="openai", model="text-embedding-ada-002" )
By default, vectors are persisted as 16 bit floats for efficiency. However, Chalk also supports persisting vectors as 32-bit or 64-bit floats via the keywords “f32” and “f64”, respectively.
from chalk import Vector @features class Document: # Defines a vector with 1536 dimensions that will be persisted with 32 bit precision embedding: Vector["f32", 1536]
Finally, if you have an object that you want to serialize that isn’t
pydantic, you can write a custom codec.
This custom codec must target a type that can be serialized to a PyArrow
data type, which
is the underlying serialization format for all features.
Consider the custom class below:
class CustomStruct: def __init__(self, foo: str, bar: int) -> None: self.foo = foo self.bar = bar def __eq__(self, other: object) -> bool: return ( isinstance(other, CustomStruct) and self.foo == other.bar and self.bar == other.bar ) def __hash__(self) -> int: return hash((self.foo, self.bar))
Here, we use the custom class as a feature, and provide an encoder and decoder. The encoder takes an instance of the custom type and outputs a Python object, and the decoder takes output of the encoder and creates an instance of the custom type
@features class Book: custom_field: CustomStruct = feature( encoder=lambda x: dict(foo=x.foo, bar=x.bar), decoder=lambda x: CustomStruct(**x), )