Embeddings - Chalk

Embedding models are generally used to calculate a vector feature. Chalk includes built-in support for common embedding models, or can define your own embedding model through a resolver.

Built-In Embedding Functions

Chalk includes built-in support for both open-source and hosted embedding models via the embedding function. We recommend using this function when possible, as Chalk will automatically handle batching and retries, and you don’t need to specify the vector size. The main arguments for this function are:

input (required): A lambda that returns the feature that will be embedded. If the embedding model takes multiple inputs (such as with INSTRUCTOR, which requires the instruction along with the content), then this lambda should return a tuple of the feature references (or a string and a feature reference, if the instruction is constant). See the INSTRUCTOR example below.
If you would like to use multiple features as input, you can define a resolver to combine these features into one. Then, reference this combined feature as input.
provider (required): The embedding model provider. The currently supported providers are sentence-transformers, instructor, openai, or cohere. Chalk may add more providers in the future.
model (required): The name of the model to use. Each provider has a different set of models that are supported.
max_staleness (optional): The duration for which the embedding will be cached. By default, the embedding vector will be cached for the same duration as the feature. If you would like different behavior, you can specify this argument explicitly.

For the complete signature, please see the api docs.

Sentence Transformers

Chalk supports all models that are part of the sentence-transformers framework. It is recommended to use the all-MiniLM-L6-v model, though all pre-trained models are supported.

from chalk.features import embed, features, Vector

@features
class Document:
    content: str
    embedding: Vector = embedding(
        input=lambda: Document.content,
        provider="sentence-transformers",
        model="all-MiniLM-L6-v",
    )

Instructor

Chalk supports INSTRUCTOR embedding models. When using this provider, the input lambda should return a tuple of the instruction and feature to encode. See the available models here.

If the instruction is the same for every row, you can use a literal (constant) string.

from chalk.features import embed, features, Vector

@features
class Document:
    content: str
    embedding: Vector = embedding(
        input=lambda: ("Represent the Legal document: ", Document.content),
        provider="instructor",
        model="hkunlp/instructor-base",
    )

However, if we have multiple types of documents, then you can use another feature to represent the instruction and define a resolver to compute the instruction.

from chalk.features import embed, features, online, Vector

@features
class Document:
    content: str
    document_type: str
    instruction: str
    embedding: Vector = embedding(
        input=lambda: (Document.instruction, Document.content),
        provider="instructor",
        model="hkunlp/instructor-base",
    )

@online
def generate_instruction(document_type: Document.document_type) -> Document.instruction:
    return f"Represent the {document_type} document: "

OpenAI

Chalk can proxy calls to the OpenAI Embeddings API. It is recommended to use the text-embedding-ada-002 model, though all OpenAI models are supported. If you don’t already have an OpenAI account, sign up here, and then create an OpenAI Integration in Chalk. All OpenAI requests will be attributed to your API key. To minimize usage, we highly recommend specifying an appropriate max staleness in Chalk, which will ensure that embeddings are cached.

from chalk.features import embed, features, Vector

@features
class Document:
    content: str
    embedding: Vector = embedding(
        input=lambda: Document.content,
        provider="openai",
        model="text-embedding-ada-002",
        max_staleness="infinity",
    )

Cohere

Chalk can proxy calls to Cohere Embed. To use this integration, first sign up for an Cohere Account, and then create an Cohere Integration in Chalk. All Cohere requests will be attributed to your API key. To minimize usage, we highly recommend specifying an appropriate max staleness in Chalk, which will ensure that embeddings are cached.

from chalk.features import embed, features, Vector

@features
class Document:
    content: str
    embedding: Vector = embedding(
        input=lambda: Document.content,
        provider="cohere",
        model="embed-english-v2.0",
        max_staleness="infinity",
    )

Custom embedding functions

If you would like to run your own embedding model, you can define a custom resolver to compute the embedding from existing features in the feature class. For performance, we recommend to store the model weights in an object store (such as AWS S3 or GCS) rather than including them your source code and to load the model using a boot hook.

from chalk.features import before_all, DataFrame, embed, features, online, Vector

my_model = MyModel()

@before_all
def load_my_model():
    my_model.initialize("s3://my-bucket/my-checkpoint.pt")


@features
class Document:
    content: str
    # When using a custom embedding function, the size of the vector must be specified.
    embedding: Vector[1536]

@online
def my_embedding_function(content: DataFrame[Document.content]) -> DataFrame[Document.embedding]:
    return my_model.embed(content.to_arrow()['document.content'])

Chalk will then call my_embedding_function whenever an embedding is needed.

​Built-In Embedding Functions

​Sentence Transformers

​Instructor

​OpenAI

​Cohere