Chalk home page
Docs
API
CLI
  1. Features
  2. Nearest Neighbors (Vector Search)

A feature class can be linked to the closest examples of another feature class. This functionality can be useful for search and retrieval applications.

Nearest neighbor relationships are only supported for vector features.

We recommend to first take a look through Chalk's support for vector features and embedding functions.

To illustrate how to use nearest neighbor relationships, we’ll walk through an example for Chalk can power FAQ search. In this example, the SearchQuery feature class represents an incoming request, and the FAQDocument feature class represents our collection of frequently asked questions and answers. Our goal is to return the five most relevant FAQ entries for the given search query.


Defining nearest neighbor relationships

Using the has_many function and the is_near method, we can express a relationship where we want the nearest documents for each query.

from chalk import embedding
from chalk.features import DataFrame, features, has_many, Vector

@features
class SearchQuery:
  query: str
  product_version: int
  embedding: Vector = embedding(...)
  nearest_documents: DataFrame[FAQDocument] = has_many(
      lambda: SearchQuery.embedding.is_near(
          FAQDocument.embedding
      )
  )
  response: str

@features
class FAQDocument:
  question: str
  product_version: int
  question_embedding: Vector = embedding(...)
  link: str

The lambda solves forward references, letting you reference SearchQuery andFAQDocument before they are defined.

Distance Measure

Nearest neighbor relationships use a distance function to measure closeness. By default, Chalk uses L2 distance, though inner product and cosine similarity are also supported. To change the distance function, use the distance argument:

from chalk.features import DataFrame, features, has_many

@features
class SearchQuery:
  nearest_documents: DataFrame[Document] = has_many(
    lambda: SearchQuery.embedding.is_near(
      FAQDocument.embedding,
      distance="cosine",  # or "inner product"
    )
  )

Nearest neighbors as resolver inputs

It’s possible to use this relationship as a has-many resolver input. The resulting documents will be returned as a Chalk DataFrame. Because the search is approximate, the number of documents to return must be specified via a slice expression.

from chalk import online

@online
def generate_response(
  # Query for the five most relevant documents, and select their links
  nearest_documents: SearchQuery.nearest_documents[FAQDocument.link, :5],
) -> SearchQuery.response:
  return "\n".join(nearest_documents[FAQDocument.link])

Filtering

Inside the input argument signature, we can include filters for more accurate results. The filters will be applied before the limit is applied.

When using a nearest neighbor relationship, do not filter within the resolver.

Filtering inside the resolver will be performed after the limit is applied, which may filter out all returned neighbors if none of them match the filter expression.

Don't filter like this

from chalk import online

@online
def generate_response(
  version: SearchQuery.product_version,
  nearest_documents: SearchQuery.nearest_documents[
    FAQDocument.link,
    FAQDocument.product_version,
    :5,
  ],
) -> SearchQuery.response:
  # Don't do this! If the nearest five documents are all for a different product version,
  # then filtered_nearest_documents will be empty
  filtered_nearest_documents = nearest_documents[FAQDocument.product_version == version]
  return "\n".join(filtered_nearest_documents[FAQDocument.link])

Instead, specify the filter conditions in the resolver signature. This will ensure that the filter is applied before the limit, meaning that the nearest five documents that much all of the filters will be returned.

Filter like this

from chalk import online

@online
def generate_response(
  filtered_nearest_documents: SearchQuery.nearest_documents[
    FAQDocument.link,
    FAQDocument.product_version == SearchQuery.product_version,
    :5,
  ],
) -> SearchQuery.response:
  return "\n".join(filtered_nearest_documents[FAQDocument.link])