Features
Define features for training and inference.
Features can be any primitive Python type:
from enum import Enum
class Genre(Enum):
FICTION = "FICTION"
NONFICTION = "NONFICTION"
DRAMA = "DRAMA"
POETRY = "POETRY"
@features
class Book:
id: int
name: str
publish_date: date
copyright_ended_at: datetime | None
genre: Genre
Features can be a list
or set
of scalar or struct types.
@dataclass
class Chapter:
start_page: int
end_page: int
@features
class Book:
authors: list[str]
categories: set[str]
chapters: list[Chapter]
Features can be a struct, which is a collection that maps a fixed set of keys to sub-fields. You can use dataclasses, Pydantic models, and attrs classes to represent struct features. Struct features can be used recursively within list features or other struct features.
Struct types should be used for objects that don’t have ids. If an object has an id, consider using has-one.
You can use any dataclass
as a struct feature.
@dataclass
class JacketInfo:
title: str
subtitle: str
body: str
@features
class Book:
id: int
jacket_info: JacketInfo
Pydantic models can also be used for stucts.
from pydantic import BaseModel, constr
class TitleInfo(BaseModel):
heading: constr(min_length=2)
subheading: Optional[str]
@features
class Book:
title: TitleInfo
...
Alternatively, you can use attrs
.
import attrs
@attrs.define
class TableOfContentsItem:
foo: str
bar: int
@features
class Book:
table_of_contents: list[TableOfContentsItem]
...
Both dataclass and Pydantic
structs are implemented using the PyArrow
serialization format, a high-performance schema for data serialization.
This data is stored “value only”, i.e. without keys, so any change to these structs over time will invalidate historical data.
To support feature values where the schema changes over time, we introduced the Document
struct type.
Documents are serialized as JSON and supports changes to schema over time, at the cost of a small performance penalty.
from pydantic import BaseModel
from chalk import Document
class AuthorInfo(BaseModel):
first_name: str
last_name: str
@features
class Book:
title: Document[AuthorInfo]
...
Features can be vectors (fixed sized arrays of floats), such as the output from an embedding model. Unlike list or set features, vector features are compatible with embedding functions and nearest neighbor similarity search.
from chalk.features import features, Vector
@features
class Document:
embedding: Vector[1536] # Defines a vector with 1536 dimensions
When using the built-in embedding functions, then the vector dimensions don’t need to be specified, as Chalk will automatically infer it from the embedding model.
from chalk import embedding
from chalk.features import features, Vector
@features
class Document:
content: str
# Chalk knows that text-embedding-ada-002 has 1536 dimensions
embedding: Vector = embedding(
input=lambda: Document.content,
provider="openai",
model="text-embedding-ada-002"
)
By default, vectors are persisted as 16 bit floats for efficiency. However, Chalk also supports persisting vectors as 32-bit or 64-bit floats via the keywords “f32” and “f64”, respectively.
from chalk.features import Vector
@features
class Document:
# Defines a vector with 1536 dimensions that will be persisted with 32 bit precision
embedding: Vector["fp32", 1536]
Finally, if you have an object that you want to serialize that isn’t
from dataclass
, attrs
, or pydantic
, you can write a custom codec.
This custom codec must target a type that can be serialized to a PyArrow
data type, which
is the underlying serialization format for all features.
Consider the custom class below:
class CustomStruct:
def __init__(self, foo: str, bar: int) -> None:
self.foo = foo
self.bar = bar
def __eq__(self, other: object) -> bool:
return (
isinstance(other, CustomStruct)
and self.foo == other.bar
and self.bar == other.bar
)
def __hash__(self) -> int:
return hash((self.foo, self.bar))
Here, we use the custom class as a feature, and provide an encoder and decoder. The encoder takes an instance of the custom type and outputs a Python object, and the decoder takes output of the encoder and creates an instance of the custom type
@features
class Book:
custom_field: CustomStruct = feature(
encoder=lambda x: dict(foo=x.foo, bar=x.bar),
decoder=lambda x: CustomStruct(**x),
)