Features
Find the nearest neighbors across a vector relationship
A feature class can be linked to the closest examples of another feature class. This functionality can be useful for search and retrieval applications.
Nearest neighbor relationships are only supported for vector features.
We recommend to first take a look through Chalk's support for vector features and embedding functions.
To illustrate how to use nearest neighbor relationships, we’ll walk through an example for Chalk can power FAQ search.
In this example, the SearchQuery
feature class represents an incoming request, and the FAQDocument
feature class represents our collection of frequently asked questions and answers. Our goal is to return the five most
relevant FAQ entries for the given search query.
Using the has_many
function and the is_near
method, we can express a relationship where we want the nearest
documents for each query.
from chalk import embedding
from chalk.features import DataFrame, features, has_many, Vector
@features
class SearchQuery:
query: str
product_version: int
embedding: Vector = embedding(...)
nearest_documents: DataFrame[FAQDocument] = has_many(
lambda: SearchQuery.embedding.is_near(
FAQDocument.embedding
)
)
response: str
@features
class FAQDocument:
question: str
product_version: int
question_embedding: Vector = embedding(...)
link: str
The lambda
solves forward references, letting you reference SearchQuery
andFAQDocument
before they are defined.
Nearest neighbor relationships use a distance function to measure closeness. By default, Chalk uses L2 distance, though
inner product and cosine similarity are also supported. To change the distance function, use the distance
argument:
from chalk.features import DataFrame, features, has_many
@features
class SearchQuery:
nearest_documents: DataFrame[Document] = has_many(
lambda: SearchQuery.embedding.is_near(
FAQDocument.embedding,
distance="cosine", # or "inner product"
)
)
It’s possible to use this relationship as a has-many resolver input. The resulting documents will be returned as a Chalk DataFrame. Because the search is approximate, the number of documents to return must be specified via a slice expression.
from chalk import online
@online
def generate_response(
# Query for the five most relevant documents, and select their links
nearest_documents: SearchQuery.nearest_documents[FAQDocument.link, :5],
) -> SearchQuery.response:
return "\n".join(nearest_documents[FAQDocument.link])
Inside the input argument signature, we can include filters for more accurate results. The filters will be applied before the limit is applied.
When using a nearest neighbor relationship, do not filter within the resolver.
Filtering inside the resolver will be performed after the limit is applied, which may filter out all returned neighbors if none of them match the filter expression.
Don't filter like this
from chalk import online
@online
def generate_response(
version: SearchQuery.product_version,
nearest_documents: SearchQuery.nearest_documents[
FAQDocument.link,
FAQDocument.product_version,
:5,
],
) -> SearchQuery.response:
# Don't do this! If the nearest five documents are all for a different product version,
# then filtered_nearest_documents will be empty
filtered_nearest_documents = nearest_documents[FAQDocument.product_version == version]
return "\n".join(filtered_nearest_documents[FAQDocument.link])
Instead, specify the filter conditions in the resolver signature. This will ensure that the filter is applied before the limit, meaning that the nearest five documents that much all of the filters will be returned.
Filter like this
from chalk import online
@online
def generate_response(
filtered_nearest_documents: SearchQuery.nearest_documents[
FAQDocument.link,
FAQDocument.product_version == SearchQuery.product_version,
:5,
],
) -> SearchQuery.response:
return "\n".join(filtered_nearest_documents[FAQDocument.link])