Features
Automatically calculate embeddings from existing features
Embedding models are generally used to calculate a vector feature. Chalk includes built-in support for common embedding models, or can define your own embedding model through a resolver.
Chalk includes built-in support for both open-source and hosted embedding models via the embedding
function.
We recommend using this function when possible, as Chalk will automatically handle batching and retries,
and you don’t need to specify the vector size. The main arguments for this function are:
input
(required): A lambda that returns the feature that will be embedded. If the embedding model takes multiple
inputs (such as with INSTRUCTOR, which requires the instruction along with the content), then this lambda should
return a tuple of the feature references (or a string and a feature reference, if the instruction is constant).
See the INSTRUCTOR example below.
If you would like to use multiple features as input, you can define a resolver to combine these features into one. Then, reference this combined feature as input.
provider
(required): The embedding model provider. The currently supported providers are sentence-transformers
,
instructor
, openai
, or cohere
. Chalk may add more providers in the future.
model
(required): The name of the model to use. Each provider has a different set of models that are supported.
max_staleness
(optional): The duration for which the embedding will be cached. By default, the embedding vector
will be cached for the same duration as the feature
. If you would like different behavior, you can specify this
argument explicitly.
For the complete signature, please see the api docs.
Chalk supports all models that are part of the sentence-transformers framework.
It is recommended to use the all-MiniLM-L6-v
model, though all
pre-trained models are supported.
from chalk.features import embedding, features, Vector
@features
class Document:
content: str
embedding: Vector = embedding(
input=lambda: Document.content,
provider="sentence-transformers",
model="all-MiniLM-L6-v",
)
Chalk supports INSTRUCTOR embedding models. When using this
provider, the input
lambda should return a tuple of the instruction and feature to encode.
See the available models here.
If the instruction is the same for every row, you can use a literal (constant) string.
from chalk.features import embedding, features, Vector
@features
class Document:
content: str
embedding: Vector = embedding(
input=lambda: ("Represent the Legal document: ", Document.content),
provider="instructor",
model="hkunlp/instructor-base",
)
However, if we have multiple types of documents, then you can use another feature to represent the instruction and define a resolver to compute the instruction.
from chalk.features import embedding, features, online, Vector
@features
class Document:
content: str
document_type: str
instruction: str
embedding: Vector = embedding(
input=lambda: (Document.instruction, Document.content),
provider="instructor",
model="hkunlp/instructor-base",
)
@online
def generate_instruction(document_type: Document.document_type) -> Document.instruction:
return f"Represent the {document_type} document: "
Chalk can proxy calls to the OpenAI Embeddings API.
It is recommended to use the text-embedding-ada-002
model, though all OpenAI models are supported.
If you don’t already have an OpenAI account, sign up here, and then create an
OpenAI Integration in Chalk. All OpenAI requests will be attributed to your API key.
To minimize usage, we highly recommend specifying an appropriate max staleness in Chalk, which will ensure that
embeddings are cached.
from chalk.features import embedding, features, Vector
@features
class Document:
content: str
embedding: Vector = embedding(
input=lambda: Document.content,
provider="openai",
model="text-embedding-ada-002",
max_staleness="infinity",
)
Chalk can proxy calls to Cohere Embed. To use this integration, first sign up for an Cohere Account, and then create an Cohere Integration in Chalk. All Cohere requests will be attributed to your API key. To minimize usage, we highly recommend specifying an appropriate max staleness in Chalk, which will ensure that embeddings are cached.
from chalk.features import embedding, features, Vector
@features
class Document:
content: str
embedding: Vector = embedding(
input=lambda: Document.content,
provider="cohere",
model="embed-english-v2.0",
max_staleness="infinity",
)
If you would like to run your own embedding model, you can define a custom resolver to compute the embedding from existing features in the feature class. For performance, we recommend to store the model weights in an object store (such as AWS S3 or GCS) rather than including them your source code and to load the model using a boot hook.
from chalk.features import before_all, DataFrame, embedding, features, online, Vector
my_model = MyModel()
@before_all
def load_my_model():
my_model.initialize("s3://my-bucket/my-checkpoint.pt")
@features
class Document:
content: str
# When using a custom embedding function, the size of the vector must be specified.
embedding: Vector[1536]
@online
def my_embedding_function(content: DataFrame[Document.content]) -> DataFrame[Document.embedding]:
return my_model.embed(content.to_arrow()['document.content'])
Chalk will then call my_embedding_function
whenever an embedding is needed.