Chalk SDK Reference

= None

The individual or team responsible for these features. The Chalk Dashboard will display this field, and alerts can be routed to owners.

tags:

= None

Added metadata for features for use in filtering, aggregations, and visualizations. For example, you can use tags to assign features to a team and find all features for a given team.

etl_offline_to_online:

= False

When True, Chalk copies this feature into the online environment when it is computed in offline resolvers. Setting etl_offline_to_online on a feature class assigns it to all features on the class which do not explicitly specify etl_offline_to_online.

max_staleness:

= None

When a feature is expensive or slow to compute, you may wish to cache its value. Chalk uses the terminology "maximum staleness" to describe how recently a feature value needs to have been computed to be returned without re-running a resolver. Assigning a max_staleness to the feature class assigns it to all features on the class which do not explicitly specify a max_staleness value of their own.

Other Parameters

▲

cls:

Type[T] | None

= None

▲

name:

= None

▲

singleton:

feature(owner, tags, ...+27)

= False

from chalk.features import features
@features(
    owner="andy@chalk.ai",
    max_staleness="30m",
    etl_offline_to_online=True,
    tags="user-group",
)
class User:
    id: str
    # Comments here appear in the web!
    # :tags: pii
    name: str | None
    # :owner: userteam@mycompany.com
    location: LatLng

Add metadata and configuration to a feature.

Parameters

= None

You may also specify which person or group is responsible for a feature. The owner tag will be available in Chalk's web portal. Alerts that do not otherwise have an owner will be assigned to the owner of the monitored feature. Read more at Owner

from chalk.features import features, feature
from datetime import date
@features
class User:
    id: str
    # :owner: user-team@company.com
    name: str
    dob: date = feature(owner="user-team@company.com")

tags:

= None

Add metadata to a feature for use in filtering, aggregations, and visualizations. For example, you can use tags to assign features to a team and find all features for a given team. Read more at Tags

from chalk.features import features, feature
@features
class User:
    id: str
    # :tags: pii
    name: str
    dob: date = feature(tags=["pii"])

version:

= None

The maximum version for a feature. Versioned features can be referred to with the @ operator:

from chalk.features import features, feature
@features
class User:
    id: str
    score: int = feature(version=2)
str(User.score @ 2)

"user.score@2"

See more at Versioning

default_version:

= 1

The default version for a feature. When you reference a versioned feature without the @ operator, you reference the default_version. Set to 1 by default.

from chalk.features import features, feature
@features
class User:
    id: str
    score: int = feature(version=2, default_version=2)
str(User.score)

"user.score"

See more at Default versions

max_staleness:

... | Duration | None

= ...

cache_nulls:

= True

When True (default), Chalk will cache all values, including nulls.

When False, Chalk will not update the null entry in the cache.

When "evict_nulls", Chalk will evict the entry that would have been null from the cache, if it exists.

Concretely, suppose the current state of a database is {a: 1, b: 2}, and you write a row {a: 2, b: None}. Here is the expected result in the db:

{a: 2, b: None} when cache_nulls=True (default)
{a: 2, b: 2} when cache_nulls=False
{a: 2} when cache_nulls="evict_nulls"

cache_defaults:

= True

When True (default), Chalk will cache all values, including default values.

When False, Chalk will not update the default entry in the cache.

When "evict_defaults", Chalk will evict the entry that would have been a default value from the cache, if it exists.

Concretely, suppose the current state of a database is {a: 1, b: 2}, and you write a row {a: 2, b: "default"}, and the default value for feature b is "default". Here is the expected result in the db:

{a: 2, b: "default"} when cache_defaults=True
{a: 2, b: 2} when cache_defaults=False
{a: 2} when cache_defaults="evict_defaults"

The cache_nulls and cache_defaults options can be used together on the same feature with the following exceptions: if cache_nulls=False, then cache_defaults cannot be "evict_defaults", and if cache_nulls="evict_defaults", then cache_defaults cannot be False.

etl_offline_to_online:

= None

When True, Chalk copies this feature into the online environment when it is computed in offline resolvers. Read more at Reverse ETL

min:

_TRich | None

= None

If specified, when this feature is computed, Chalk will check that x >= min.

from chalk.features import features, feature
@features
class User:
    id: str
    fico_score: int = feature(min=300, max=850)

max:

_TRich | None

= None

If specified, when this feature is computed, Chalk will check that x <= max.

from chalk.features import features, feature
@features
class User:
    id: str
    fico_score: int = feature(min=300, max=850)

min_length:

= None

If specified, when this feature is computed, Chalk will check that len(x) >= min_length.

from chalk.features import features, feature
@features
class User:
    id: str
    name: str = feature(min_length=1)

max_length:

= None

If specified, when this feature is computed, Chalk will check that len(x) <= max_length.

from chalk.features import features, feature
@features
class User:
    id: str
    name: str = feature(min_length=1000)

strict:

= False

If True, if this feature does not meet the validation criteria, Chalk will not persist the feature value and will treat it as failed.

validations:

list[Validation] | None

= None

A list of validations to apply to this feature. Generally, max, min, max_length, and min_length are more convenient, but the parameter strict applies to all of those parameters. Use this parameter if you want to mix strict and non-strict validations.

dtype:

pa.DataType | None

= None

The backing pyarrow.DataType for the feature. This parameter can be used to control the storage format of data. For example, if you have a lot of data that could be represented as smaller data types, you can use this parameter to save space.

import pyarrow as pa
from chalk.features import features, feature
@features
class WatchEvent:
    id: str
    duration_hours: float = feature(dtype=pa.float8())

_TRich | ...

= ...

The default value of the feature if it otherwise can't be computed. If you don't need to specify other metadata, you can also assign a default in the same way you would assign a default to a dataclass:

from chalk.features import features
@features
class User:
    num_purchases: int = 0

= None

= None

An underscore expression for defining the feature. Typically, this value is assigned directly to the feature without needing to use the feature(...) function. However, if you want to define other properties, like a default or max_staleness, you'll want to use the expression keyword argument.

from chalk.features import features
from chalk import _
@features
class Receipt:
    subtotal: int
    tax: int = 0  # Default value, without other metadata
    total: int = feature(expression=_.subtotal + _.tax, default=0)

See more at Expressions

deprecated:

= False

If True, this feature is considered deprecated, which impacts the dashboard, alerts, and warnings.

from chalk.features import features, feature
@features
class User:
    id: str
    name: str = feature(deprecated=True)

store_online:

= True

By default True. Setting to False will prevent this feature from being written to the online store.

store_offline:

= True

By default True. Setting to False will prevent this feature from being written to the offline store.

versions:

dict[int, _TRich] | None

= None

A map from integer feature version to feature definition. If this argument is used, then no other argument to features can be used.

typ:

Type[_TRich] | None

= None

The type of the feature—this can be used as an alternative to the annotation in notebook feature definitions.

Other Parameters

▲

description:

= None

▲

name:

= None

▲

primary:

Windowed.__getitem__(item)

= None

▲

encoder:

TEncoder[_TPrim, _TRich] | None

= None

▲

decoder:

TDecoder[_TPrim, _TRich] | None

= None

▲

offline_ttl:

... | Duration | None

= ...

Returns

type:

_TRich

The type of the input feature, given by _TRich.

from chalk.features import Primary, features, feature
@features
class User:
    uid: Primary[int]
    # Uses a default value of 0 when one cannot be computed.
    num_purchases: int = 0
    # Description of the name feature.
    # :owner: fraud@company.com
    # :tags: fraud, credit
    name: str = feature(
        max_staleness="10m",
        etl_offline_to_online=True
    )
    score: int = feature(
        version=2, default_version=2
    )

has_one(f)

Specify a feature that represents a one-to-one relationship.

This function allows you to explicitly specify a join condition between two @features classes. When there is only one way to join two classes, we recommend using the foreign-key definition instead of this has_one function. For example, if you have a User class and a Card class, and each user has one card, you can define the Card and User classes as follows:

from chalk.features import features
@features
class User
    id: str
@features
class Card
    id: str
    user_id: User.id
    user: User

However, if User has two cards (say, a primary and secondary), the foreign key syntax cannot be used to define the relationship, and you should use the has_one function.

Examples

@features
class User:
    failed_logins: Windowed[int] = windowed("10m", "24h")

Attributes

buckets_seconds

set[int]

kind

Type[TRich]

Functions

windowed(buckets, days, ...+22)

Create a windowed feature.

See more at Windowed Aggregations

Parameters

buckets:

= ()

The size of the buckets for the window function. Buckets are specified as strings in the format "1d", "2h", "1h30m", etc. You may also choose to specify the buckets using the days, hours, and minutes parameters instead. The buckets parameter is helpful if you want to use multiple units to express the bucket size, like "1h30m".

days:

Iterable[int]

= ()

Convenience parameter for specifying the buckets in days. Using this parameter is equvalent to specifying the buckets parameter with a string like "1d".

hours:

Iterable[int]

= ()

Convenience parameter for specifying the buckets in hours. Using this parameter is equvalent to specifying the buckets parameter with a string like "1h".

minutes:

Iterable[int]

= ()

Convenience parameter for specifying the buckets in minutes. Using this parameter is equvalent to specifying the buckets parameter with a string like "1m".

= None

tags:

Any | None

= None

Add metadata to a feature for use in filtering, aggregations, and visualizations. For example, you can use tags to assign features to a team and find all features for a given team.

TRich | ...

= ...

The default value of the feature if it otherwise can't be computed.

max_staleness:

Duration | ... | None

= ...

See more at Feature Caching

offline_ttl:

Duration | ... | None

= ...

Sets a maximum age for values eligible to be retrieved from the offline store, defined in relation to the query's current point-in-time.

version:

= None

Feature versions allow you to manage a feature as its definition changes over time.

The version keyword argument allows you to specify the maximum number of versions available for this feature.

See more at Versioning

etl_offline_to_online:

= None

When True, Chalk copies this feature into the online environment when it is computed in offline resolvers.

See more at Reverse ETL

min:

TRich | None

= None

If specified, when this feature is computed, Chalk will check that x >= min.

max:

TRich | None

= None

If specified, when this feature is computed, Chalk will check that x <= max.

min_length:

= None

If specified, when this feature is computed, Chalk will check that len(x) >= min_length.

max_length:

= None

If specified, when this feature is computed, Chalk will check that len(x) <= max_length.

strict:

= False

If True, if this feature does not meet the validation criteria, Chalk will not persist the feature value and will treat it as failed.

validations:

list[Validation] | None

= None

A list of Validations to apply to this feature.

See more at https://docs.chalk.ai/api-docs#Validation

expression:

Underscore | None

= None

The expression to compute the feature. This is an underscore expression, like _.transactions[_.amount].sum().

materialization:

MaterializationWindowConfig | True | None

= None

Configuration for aggregating data. Pass bucket_duration with a Duration to configure the bucket size for aggregation. If True, each of the windows will use a bucket duration equal to its window duration.

See more at https://docs.chalk.ai/docs/materialized_aggregations

Other Parameters

▲

description:

= None

▲

name:

MaterializationWindowConfig

= None

▲

encoder:

TEncoder[TPrim, TRich] | None

= None

▲

decoder:

TDecoder[TPrim, TRich] | None

= None

▲

dtype:

pa.DataType | None

= None

Returns

type:

Windowed[TRich]

Metadata for the windowed feature, parameterized by TPrim (the primitive type of the feature) and TRich (the decoded type of the feature, if decoder is provided).

from chalk.features import features
from chalk.streams import windowed, Windowed
@features
class User:
    id: int
    email_count: Windowed[int] = windowed(days=range(1, 30))
    logins: Windowed[int] = windowed("10m", "1d", "30d")
User.email_count["7d"]

Class

Configuration for window aggregates. At least one of bucket_duration and bucket_durations must be provided.

If both are provided, bucket_duration acts as a default for the window materialization, which may be overridden by bucket_duration.

Attributes

bucket_duration

continuous_buffer_duration

The duration of each bucket in the window, using a chalk.Duration string, e.g. "1m", "1h", "1d".

To use different bucket durations for different window sizes, see bucket_durations below.

bucket_durations

dict[Duration, Sequence[Duration] | Duration]

A mapping from the desired bucket duration to the window size(s) that should use that bucket duration.

If bucket_duration is also provided, any window durations not specified in this mapping will pick up the bucket duration from the bucket_duration parameter.

This parameter is useful when you have some very large windows and some very small windows. For example, if you have a 365-day window and a 10-minute window, you wouldn't want to maintain 365 days of 10-minute buckets in the online store. However, using a 1-day bucket for the 10-minute window would also lead to significantly more events fitting into the window than you might want.

In this case, you could specify:

count: Windowed[int] = windowed(
    "1d", "7d", "60d", "365d",
    materialization={
        # 1-day buckets as a default, for the 7d window
        bucket_duration="1d",
        bucket_durations={
            # 10-minute buckets for the 1d window
            "10m": "1d",
            # 5-day buckets for 60d and 365d windows
            "5d": ["60d", "365d"],
        }
    },
    expression=_.events.count(),
)

bucket_start

datetime

The lower bound of the first bucket. All buckets will start at some multiple of the bucket duration after this time.

bucket_starts

dict[datetime, Sequence[Duration] | Duration]

Used to specify a different bucket start for each window duration. Same format as bucket_durations.

backfill_schedule

CronTab | None

The schedule on which to automatically backfill the aggregation. For example, "* * * * *" or "1h".

Primary.__class_getitem__(item)

The minimum period of time for which to sample data directly via online query, rather than from the backfilled aggregations.

Primary

Class

Marks a feature as the primary feature for a feature class.

Features named id on feature classes without an explicit primary feature are declared primary keys by default, and don't need to be marked with Primary.

If you have primary key feature with a name other than id, you can use this marker to indicate the primary key.

Examples

from chalk.features import features
from chalk import Primary
@features
class User:
    username: Primary[str]

Functions

Parameters

item:

typing.Union[Type, str, int]

The type of the feature value.

Returns

The type, with a special annotation indicating that it is a primary key.

from chalk.features import features
from chalk import Primary
@features
class User:
    username: Primary[str]

Validation

Class

Specify explicit data validation for a feature.

The feature() function can also specify these validations, but this class allows you to specify both strict and non-strict validations at the same time.

Functions

Validation.__init__(min, max, ...+3)

Set validation parameters for a feature.

Parameters

min:

T | None

= None

If specified, when this feature is computed, Chalk will check that x >= min.

max:

T | None

= None

If specified, when this feature is computed, Chalk will check that x <= max.

min_length:

= None

If specified, when this feature is computed, Chalk will check that len(x) >= min_length.

max_length:

= None

If specified, when this feature is computed, Chalk will check that len(x) <= max_length.

strict:

Vector.to_numpy(writable)

= False

If True, if this feature does not meet the validation criteria, Chalk will not persist the feature value and will treat it as failed.

from chalk.features import features, feature
@features
class User:
    fico_score: int = feature(
        validations=[
            Validation(min=300, max=850, strict=True),
            Validation(min=300, max=320, strict=False),
            Validation(min=840, max=850, strict=False),
        ]
    )
    # If only one set of validations were needed,
    # you can use the [`feature`](#feature) function instead:
    first_name: str = feature(
        min_length=2, max_length=64, strict=True
    )

Vector

Class

The Vector class can be used type annotation to denote a Vector feature.

Instances of this class will be provided when working with raw vectors inside of resolvers. Generally, you do not need to construct instances of this class directly, as Chalk will automatically convert list-like features into Vector instances when working with a Vector annotation.

Parameters

data: numpy.Array | list[float] | pyarrow.FixedSizeListScalar The vector values

Examples

from chalk.features import Vector, features
@features
class Document:
    embedding: Vector[1536]

Attributes

precision

'fp16' | 'fp32' | 'fp64'

The precision of the Vector

Functions

Vector.to_arrow_scalar()

Convert a vector to a PyArrow array.

Returns

type:

pa.FixedSizeListScalar

The vector, as a PyArrow array.

Vector.to_arrow_array()

Convert a vector to a PyArrow array.

Returns

type:

pa.Array

The vector, as a PyArrow array.

Convert the vector to a Numpy array.

Parameters

writable:

Vector.__class_getitem__(item)

= False

Whether the numpy array should be writable. If so, an extra copy of the vector data will be made.

Returns

type:

np.ndarray

The vector, as a numpy array.

Vector.to_pylist()

Convert the vector to a Python list.

Returns

type:

list[float]

The vector, as a list of Python floats

Overloads

▲Vector.__class_getitem__(item)

Vector.is_near(other, metric)

Define a nearest neighbor relationship for performing Vector similarity search.metric: "l2" | "ip" | "cos" The metric to use to compute distance between two vectors. L2 Norm ("l2"), Inner Product ("ip"), and Cosine ("cos") are supported. Defaults to "l2".

Parameters

other:

@online(environment, tags, ...+11)

The other vector feature. This vector must have the same dtype and dimensions.

metric:

'l2' | 'ip' | 'cos'

= 'l2'

Returns

type:

Filter

A nearest neighbor relationship filter.

Resolvers

Decorator to create an online resolver.

Parameters

= None

Environments are used to trigger behavior in different deployments such as staging, production, and local development. For example, you may wish to interact with a vendor via an API call in the production environment, and opt to return a constant value in a staging environment.

Environment can take one of three types:

None (default) - candidate to run in every environment
str - run only in this environment
list[str] - run in any of the specified environment and no others

Read more at Environments

tags:

= None

Allow you to scope requests within an environment. Both tags and environment need to match for a resolver to be a candidate to execute.

You might consider using tags, for example, to change out whether you want to use a sandbox environment for a vendor, or to bypass the vendor and return constant values in a staging environment.

Read more at Tags

cron:

CronTab | Duration | Cron | None

= None

You can schedule resolvers to run on a pre-determined schedule via the cron argument to resolver decorators.

Cron can sample all examples, a subset of all examples, or a custom provided set of examples.

Read more at Environments

tags:

= None

Allow you to scope requests within an environment. Both tags and environment need to match for a resolver to be a candidate to execute.

You might consider using tags, for example, to change out whether you want to use a sandbox environment for a vendor, or to bypass the vendor and return constant values in a staging environment.

Read more at Tags

cron:

CronTab | Duration | Cron | None

= None

You can schedule resolvers to run on a pre-determined schedule via the cron argument to resolver decorators.

Cron can sample all examples, a subset of all examples, or a custom provided set of examples.

Read more at Environments

Read more at Tags

__doc__

The docstring of the resolver.

__name__

The function name of the resolver.

__module__

The python module where the function is defined

__annotations__

dict[str, Any]

The type annotations for the resolver

filename

The filename in which the resolver is defined.

The name of the resolver, either given by the name of the function, or by the keyword argument name given to @offline or @online.

resource_hint

ResourceHint | None

Whether this resolver is bound by CPU or I/O

static

whether the resolver is static. Static resolvers are "executed" once during planning time to produce a computation graph.

fqn

ResolverProtocol.__call__(args, kwargs)

The fully qualified name for the resolver

Functions

Returns the result of calling the function decorated with @offline or @online with the given arguments.

Parameters

args:

P.args

= ()

The arguments to pass to the decorated function. If one of the arguments is a DataFrame with a filter or projection applied, the resolver will only be called with the filtered or projected data. Read more at https://docs.chalk.ai/docs/unit-tests#data-frame-inputs

kwargs:

P.kwargs

= {}

Returns

type:

T_co

The result of calling the decorated function with args. Useful for unit-testing.

Examples

from chalk.client import ChalkClient
from chalk.features import features
from chalk import ChalkContext, online
import requests
import json
@features
class User:
    id: int
    endpoint_url: str
    endpoint_response: str
@online
def get_user_endpoint_response(endpoint_url: User.endpoint_url) -> User.endpoint_response:
    context_headers = {}
    optional_correlation_id = ChalkContext.get("request_correlation_id")
    if optional_correlation_id is not None:
        context_headers["correlation-id"] = optional_correlation_id
    response = requests.get(endpoint_url, headers=context_headers)
    return json.dumps(response.json())
ChalkClient().query(
    input={User.id: 1, User.endpoint_url: "https://api.example.com/message"},
    output=[User.endpoint_response],
    query_context={"request_correlation_id": "df0cc84b-bb0e-41b1-82cd-74ccd968b2fa"},
)

Functions

ChalkContext.get(key, default)

Parameters

key:

The key to get from the context.

JsonValue | None

= None

The default value to return if the key is not found. None by default.

Returns

type:

JsonValue | None

The value associated with the key in the context, or the default value if the key is not found.

Chalk includes a DataFrame class that models tabular data in much the same way that pandas does. However, there are some key differences that allow the Chalk DataFrame to increase type safety and performance.

Like pandas, Chalk's DataFrame is a two-dimensional data structure with rows and columns. You can perform operations like filtering, grouping, and aggregating on a DataFrame. However, there are two main differences.

Lazy implementation - Chalk's DataFrame is lazy and can be backed by multiple data sources, where a pandas.DataFrame executes eagerly in memory.
Use as a type - Chalk's DataFrame[...] can be used to represent a type of data with pre-defined filters.

Lazy Execution

Unlike pandas, the implementation of a Chalk DataFrame is lazy, and can be executed against many different backend sources of data. For example, in unit tests, a DataFrame uses an implementation backed by polars. But if your DataFrame was returned from a SQL source, filters and aggregations may be pushed down to the database for efficient execution.

Use as a Type

Each column of a Chalk DataFrame typed by a Feature type. For example, you might have a resolver returning a DataFrame containing user ids and names:

@features
class User:
   id: int
   name: str
   email: str

@online
def get_users() -> DataFrame[User.id, User.name]:
   return DataFrame([
       User(id=1, name="Alice"),
       User(id=2, name="Bob")
   ])

Note that the DataFrame type is parameterized by the columns that it contains. In this case, the DataFrame contains two columns, User.id and User.name.

The specific operations available on a DataFrame are discussed below. These operations can be called within Python resolvers or notebooks, however are distinct from the functions and aggregations that can be used in Chalk Expressions. For a higher-level discussion, see DataFrame.

Class

Chalk's DataFrame class models tabular data similar to how pandas and polars do. DataFrame can be used a Chalk feature type, when defining has-many joins between feature classes. For more about the Chalk DataFrame, see: Chalk DataFrame

NOTE: DataFrame functions are meant to be used in Python resolvers or in notebooks, but NOT in Chalk Expressions.

Attributes

filters

ClassVar[tuple[Filter, ...]]

columns

tuple[Feature, ...]

__limit__

DataFrame.__class_getitem__(cols)

The maximum number of rows to return

shape

tuple[int, int]

The shape of the DataFrame as a tuple of (num_rows, num_columns).

Examples

DataFrame({User.id: [1, 2, 3, 4, 5]}).shape

(5, 1)

Functions

DataFrame.__init__(data, missing_value_strategy, ...+4)

Construct a Chalk DataFrame.

Parameters

data:

= None

The data. Can be an existing pandas.DataFrame, polars.DataFrame or polars.LazyFrame, a sequence of feature instances, or a dict mapping a feature to a sequence of values.

missing_value_strategy:

MissingValueStrategy

= 'default_or_allow'

The strategy to use to handle missing values.

A feature value is "missing" if it is an ellipsis (...), or it is None and the feature is not annotated as Optional[...].

The available strategies are:

'error': Raise a TypeError if any missing values are found. Do not attempt to replace missing values with the default value for the feature.
'default_or_error': If the feature has a default value, then replace missing values with the default value for the feature. Otherwise, raise a TypeError.
'default_or_allow': If the feature has a default value, then replace missing values with the default value for the feature. Otherwise, leave it as None. This is the default strategy.
'allow': Allow missing values to be stored in the DataFrame. This option may result non-nullable features being assigned None values.

= None

= True

Optional[Type[BaseModel]]

= None

verify_validity:

DataFrame.__getitem__(item)

= True

Row-wise construction

df = DataFrame([
    User(id=1, first="Sam", last="Wu"),
    User(id=2, first="Iris", last="Xi")
])

Column-wise construction

df = DataFrame({
    User.id: [1, 2],
    User.first: ["Sam", "Iris"],
    User.last: ["Wu", "Xi"]
})

Construction from polars.DataFrame

import polars
df = DataFrame(polars.DataFrame({
    "user.id": [1, 2],
    "user.first": ["Sam", "Iris"],
    "user.last": ["Wu", "Xi"]
}))

Filter the rows of a DataFrame or project out columns.

You can select columns out of a DataFrame from the set of columns already present to produce a new DataFrame scoped down to those columns.

Or, you can filter the rows of a DataFrame by using Python's built-in operations on feature columns.

Parameters

item:

DataFrame.group_by(group, agg)

Filters and projections to apply to the DataFrame.

Returns

type:

Features | DataFrame

A DataFrame with the filters and projections in item applied.

df = DataFrame({
    User.age: [21, 22, 23],
    User.email: [...],
})

Filtering

df = df[
    User.age > 21 and
    User.email == "joe@chalk.ai"
]

Projecting

df[User.name]

Filtering & Projecting

df = df[
    User.age > 21 and
    User.email == "joe@chalk.ai",
    User.name
]

Aggregate the DataFrame by the specified columns. This can be used in Python resolvers or notebooks.

Parameters

dict[Feature | Any, Feature | Any]

A mapping from the desired column name in the resulting DataFrame to the name of the column in the source DataFrame.

agg:

A mapping from the desired column name in the resulting DataFrame to the aggregation operation to perform on the source DataFrame.

Returns

type:

DataFrame.histogram_list(nbins, bin_width, ...+4)

The DataFrame with the specified aggregations applied.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [1, 5, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.median(User.val)}
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ 3        │
├─────────┼──────────┤
│  3      │ 10       │
╰─────────┴──────────╯

Compute a histogram with fixed width bins. This can be used in Python resolvers or notebooks.

Parameters

nbins:

= None

If supplied, will be used to compute the binwidth.

bin_width:

float | int | Duration | None

= None

If not supplied, computed from the data (actual max and min values).

base:

float | int | datetime | date | None

= None

The value of the first histogram bin. Defaults to the minimum value of column.

eps:

float

= 1e-13

Allowed floating point epsilon for histogram base

column:

Any | None

= None

The column to compute the histogram on. If not supplied, the DataFrame is assumed to contain a single column.

descending:

DataFrame.group_by_hopping(index, agg, ...+5)

= False

If True, the histogram buckets will be sorted in descending order.

Returns

type:

list[int]

A list of the counts in each bin.

DataFrame({
  Taco.price: list(range(100, 200)),
}).histogram_list(nbins=4, base=100)

[25, 25, 25, 25]

Group based on a time value (date or datetime).

The groups are defined by a time-based window, and optionally, other columns in the DataFrame. The "width" of the window is defined by the period parameter, and the spacing between the windows is defined by the every parameter. Note that if the every parameter is smaller than the period parameter, then the windows will overlap, and a single row may be assigned to multiple groups.

As an example, consider the following DataFrame:


val:    a  b    c   d e     f           g h
    ─────────●─────────●─────────●─────────●───────▶
time:        A         B         C         D
    ┌─────────┐
1   │   a  b  │                                    1: [a, b]
    └────┬────┴────┐
2   ◀───▶│ b    c  │                               2: [b, c]
    every└────┬────┴────┐
3   ◀────────▶│ c   d e │                          3: [c, d, e]
      period  └────┬────┴────┐
4                  │d e     f│                     4: [d, e, f]
                   └────┬────┴────┐
5                       │   f     │                5: [f]
                        └────┬────┴────┐
6                            │         │
                             └────┬────┴────┐
7                                 │     g h │      7: [g, h]
                                  └────┬────┴────┐
8                                      │g h      │ 8: [g, h]
                                       └─────────┘

In the above example, the sixth time bucket is empty, and will not be included in the resulting DataFrame.

This can be used in Python resolvers or notebooks.

Parameters

index:

Feature | Any

The column to use as the index for the time-based grouping.

agg:

A mapping from the desired column name in the resulting DataFrame to the aggregation operation to perform on the source DataFrame.

every:

str | timedelta

The spacing between the time-based windows. This parameter can be specified as a str or a timedelta. If specified as a str, then it must be a valid Duration.

dict[Feature | Any, Feature | Any] | None

= None

A mapping from the desired column name in the resulting DataFrame to the name of the column in the source DataFrame. This parameter is optional, and if not specified, then the resulting DataFrame groups will be determined by the index parameter alone.

period:

str | timedelta | None

= None

The width of the time-based window. This parameter can be specified as a str or a timedelta. If specified as a str, then it must be a valid Duration. If None it is equal to every.

Other Parameters

▲

offset:

str | timedelta | None

= None

▲

start_by:

= 'window'

Returns

type:

DataFrame.join(df, on, ...+1)

A new DataFrame with the specified time-based grouping applied. The resulting DataFrame will have a column for each of the keys in group", "or each of the keys inagg, and for theindex` parameter.

from chalk import DataFrame, op
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [1, 5, 10],
        User.ts: [datetime(2020, 1, 1), datetime(2020, 1, 1), datetime(2020, 1, 3)],
    },
).group_by_hopping(
     index=User.ts,
     group={User.id: User.id},
     agg={User.val: op.median(User.val)},
     period="1d",
)

╭─────────┬──────────┬──────────╮
│ User.id │ User.ts  │ User.val │
╞═════════╪══════════╪══════════╡
│  1      │ 2020-1-1 │ 3        │
├─────────┼──────────┼──────────┤
│  3      │ 2020-1-3 │ 10       │
╰─────────┴──────────┴──────────╯

DataFrame.vstack(other)

Vertically stack the DataFrame with another DataFrame containing the same columns. The DataFrame other will be appended to the bottom of this DataFrame. This can be used in Python resolvers or notebooks.

Parameters

other:

The other DataFrame to stack with this DataFrame.

Returns

type:

DataFrame.num_unique(column)

The DataFrame with the other DataFrame stacked on the bottom.

from chalk.features import DataFrame
df = DataFrame([
    User(id=1, first="Sam", last="Wu"),
    User(id=2, first="Iris", last="Xi")
])
df.vstack(df)

Return the number of unique values in the specified column. This can be used in Python resolvers or notebooks.

Parameters

column:

= None

The column to compute the number of unique values for. If None, then the number of unique values in the entire DataFrame is returned.

Returns

type:

DataFrame.rename(mapping)

The number of unique values in the specified column.

from chalk.features import DataFrame
df = DataFrame([
    User(id=1, first="Sam", last="Wu"),
    User(id=2, first="Iris", last="Xi")
])
df.num_unique(User.id)

Rename columns in the DataFrame.

Parameters

mapping:

A mapping from the current feature for a column to the desired feature for the column.

Returns

type:

DataFrame.with_column(column, value)

The DataFrame with the specified columns renamed.

df = DataFrame([
    User(id=1, first="Sam", last="Wu"),
    User(id=2, first="Iris", last="Xi")
]).rename({User.last: User.family})

Add a column to the DataFrame.

Parameters

column:

The name of the column to add.

DataFrame.with_columns(c)

The definition of the column to add. This could be a constant value (e.g. 1 or True), an expression (e.g. op.max(User.score_1, User.score_2)), or a list of values (e.g. [1, 2, 3]).

df = DataFrame([
    User(id=1, first="Sam", last="Wu"),
    User(id=2, first="Iris", last="Xi")
])
# Set the fraud score to 0 for all users
df.with_column(User.fraud_score, 0)
# Concatenation of first & last as full_name
df.with_column(
    User.full_name, op.concat(User.first, User.last)
)
# Alias a column name
df.with_column(
    User.first_name, User.first
)

Add columns to the DataFrame.

Parameters

A Mapping from the desired name of the column in the DataFrame to the definition of the new column.

Returns

type:

DataFrame.from_dict(data)

A new DataFrame with all the existing columns, plus those specified in this function.

df = DataFrame([
    User(id=1, first="Sam", last="Wu"),
    User(id=2, first="Iris", last="Xi")
])
# Set the fraud score to 0 for all users
df.with_columns({User.fraud_score: 0})
# Concatenation of first & last as full_name
df.with_columns({
    User.full_name: op.concat(User.first, User.last)
})
# Alias a column name
df.with_columns({
    User.first_name: User.first
})

Deprecated. Use DataFrame(...) instead.

DataFrame.from_list(data)

Deprecated. Use DataFrame(...) instead.

DataFrame.read_delta(table_uri, version, ...+3)

DataFrame.read_parquet(path, columns, ...+1)

DataFrame.read_csv(path, has_header, ...+1)

Read a .csv file as a DataFrame.

Parameters

path:

The path to the .csv file. This may be a S3 or GCS storage url.

has_header:

= True

Whether the .csv file has a header row as the first row.

columns:

| None

= None

A mapping of index to feature name.

Returns

type:

DataFrame.read_avro(path)

A DataFrame with the contents of the file loaded as features.

values = DataFrame.read_csv(
    "s3://...",
    columns={0: MyFeatures.id, 1: MyFeatures.name},
    has_header=False,
)

Read a .avro file as a DataFrame.

Parameters

path:

The path to the .avro file. This may be a S3 or GCS storage url.

Returns

type:

A DataFrame with the contents of the file loaded as features.

values = DataFrame.read_avro(
    "s3://...",
)

DataFrame.max()

Compute the max value of each of the columns in the DataFrame. The resulting DataFrame will have a single row with the max value of each column.

Returns

type:

A DataFrame with the max value of each column.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 2, 3],
        User.val: [1, 4, 10],
    }
).max()

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│ 3       │ 10       │
╰─────────┴──────────╯

DataFrame.mean()

Compute the mean value of each of the columns in the DataFrame. The resulting DataFrame will have a single row with the mean value of each column.

Returns

type:

A DataFrame with the mean value of each column.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 2, 3],
        User.val: [1, 4, 10],
    }
).mean()

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│ 2       │ 5        │
╰─────────┴──────────╯

DataFrame.median()

Compute the median value of each of the columns in the DataFrame. The resulting DataFrame will have a single row with the median value of each column.

Returns

type:

A DataFrame with the median value of each column.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 2, 3],
        User.val: [1, 4, 10],
    }
).median()

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│ 2       │ 4        │
╰─────────┴──────────╯

DataFrame.min()

Compute the min value of each of the columns in the DataFrame. The resulting DataFrame will have a single row with the min value of each column.

Returns

type:

A DataFrame with the min value of each column.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 2, 3],
        User.val: [1, 4, 10],
    }
).min()

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│ 1       │ 1        │
╰─────────┴──────────╯

DataFrame.std(ddof)

Compute the standard deviation of each of the columns in the DataFrame. The resulting DataFrame will have a single row with the standard deviation of each column.

Parameters

ddof:

= 1

"Delta Degrees of Freedom": the divisor used in the calculation is N - ddof, where N represents the number of elements. By default, ddof is 1.

Returns

type:

A DataFrame with the standard deviation of each column.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 2, 3],
        User.val: [1, 4, 10],
    }
).std()

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│ 1       │ 4.5826   │
╰─────────┴──────────╯

DataFrame.sum()

Compute the sum of each of the columns in the DataFrame. The resulting DataFrame will have a single row with the sum of each column.

Returns

type:

A DataFrame with the sum of each column.

DataFrame.var(ddof)

Compute the variance of each of the columns in the DataFrame.

Parameters

ddof:

= 1

"Delta Degrees of Freedom": the divisor used in the calculation is N - ddof, where N represents the number of elements. By default, ddof is 1.

Returns

type:

A DataFrame with the variance of each column.

DataFrame.any()

Returns whether any the values in the DataFrame are truthy. Requires the DataFrame to only contain boolean values.

DataFrame.all()

Returns whether all the values in the DataFrame are truthy. Requires the DataFrame to only contain boolean values.

DataFrame.__len__()

Returns the number of rows in the DataFrame.

Returns

type:

The number of rows in the DataFrame.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 2, 3],
        User.val: [1, 4, 10],
    }
)
len(df)

DataFrame.count()

Returns the number of rows in the DataFrame.

Returns

type:

DataFrame.sort(by, more_by, ...+2)

The number of rows in the DataFrame.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 2, 3],
        User.val: [1, 4, 10],
    }
)
len(df)

df.count()

DataFrame.item()

Get the only item from the DataFrame. This method will raise an error if the DataFrame contains more than one row or more than column.

Sort the DataFrame by the given columns.

Parameters

by:

str | Feature | Any

Feature(s) to sort by. Strings are parsed as feature names.

more_by:

str | Feature | Any

= ()

Additional columns to sort by, specified as positional arguments.

descending:

= False

Sort in descending order. When sorting by multiple columns, can be specified per feature by passing a sequence of booleans.

nulls_last:

= False

Place null values last.

Returns

type:

DataFrame.__truediv__(other)

A new DataFrame with the rows sorted.

df = DataFrame({
    User.a: [1, 2, 3],
    User.b: [3, 2, 1],
})
df.sort(User.a)

a  b
-----------
0     1  3
1     2  2
2     3  1

DataFrame.__bool__()

DataFrame.__float__()

DataFrame.__int__()

DataFrame.__eq__(other)

DataFrame.__ne__(other)

DataFrame.__gt__(other)

DataFrame.__lt__(other)

DataFrame.__ge__(other)

DataFrame.__le__(other)

DataFrame.__add__(other)

DataFrame.__sub__(other)

DataFrame.__mul__(other)

DataFrame.__floordiv__(other)

DataFrame.__mod__(other)

DataFrame.__pow__(other)

DataFrame.to_polars(prefixed)

Get the underlying DataFrame as a polars.LazyFrame.

Parameters

prefixed:

= True

Whether to prefix the column names with the feature namespace (i.e. if prefixed=True, user.name, if if prefixed=False, name)

Returns

type:

DataFrame.to_pyarrow(prefixed)

The underlying polars.LazyFrame.

Get the underlying DataFrame as a pyarrow.Table.

Parameters

prefixed:

DataFrame.to_pandas(string_names, prefixed)

= True

Whether to prefix the column names with the feature namespace (i.e. if prefixed=True, user.name, if if prefixed=False, name)

Returns

type:

pa.Table

The underlying pyarrow.Table. This format is the canonical representation of the data in Chalk.

Get the underlying DataFrame as a pandas.DataFrame.prefixed Whether to prefix the column names with the feature namespace (i.e. if prefixed=True, user.name, if if prefixed=False, name)

Parameters

string_names:

= False

If True, use strings for column names. If False, use Feature objects.

prefixed:

DataFrame.slice(offset, length)

= True

Returns

type:

pd.DataFrame

The data formatted as a pandas.DataFrame.

DataFrame.to_features()

Get values in the DataFrame as Features instances.

df = DataFrame({
    SpaceShip.id: [1, 2],
    SpaceShip.volume: [4_000, 5_000]
})
df.to_features()

[
    SpaceShip(id=1, volume=4000),
    SpaceShip(id=2, volume=5000)
]

Slice the DataFrame.

Parameters

offset:

= 0

The offset to start at.

length:

= None

The number of rows in the slice. If None (the default), include all rows from offset to the end of the DataFrame.

Returns

type:

op.concat(col, col2, ...+1)

The dataframe with the slice applied.

Class

Operations for aggregations in DataFrame.

The class methods on this class are used to create aggregations for use in DataFrame.group_by.

Functions

op.sum(col, cols)

Add together the values of col and *cols in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

There must be at least one column to aggregate.

cols:

Feature | FeatureWrapper | str | Any

= ()

Subsequent columns to aggregate.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [0.5, 4, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.sum(User.val)}
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ 4.5      │
├─────────┼──────────┤
│  3      │ 10       │
╰─────────┴──────────╯

op.product(col)

Multiply together the values of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column to aggregate. Used in DataFrame.group_by.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [0.5, 4, 10],
        User.active: [True, True, False],
    }
).group_by(
     group={User.id: User.id}
     agg={
        User.val: op.product(User.val),
        User.active: op.product(User.active),
     }
)

╭─────────┬──────────┬─────────────╮
│ User.id │ User.val │ User.active │
╞═════════╪══════════╪═════════════╡
│  1      │ 2        │ 1           │
├─────────┼──────────┼─────────────┤
│  3      │ 10       │ 0           │
╰─────────┴──────────┴─────────────╯

op.max(col)

Find the maximum of the values of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the maximum value.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [0.5, 4, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.max(User.val)}
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ 4        │
├─────────┼──────────┤
│  3      │ 10       │
╰─────────┴──────────╯

op.min(col)

Find the minimum of the values of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the minimum value.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [0.5, 4, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.min(User.val)}
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ 0.5      │
├─────────┼──────────┤
│  3      │ 10       │
╰─────────┴──────────╯

op.quantile(col, q)

op.median(col)

Find the median of the values of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the median value. In the case of an even number of elements, the median is the mean of the two middle elements.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [1, 5, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.median(User.val)}
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ 3        │
├─────────┼──────────┤
│  3      │ 10       │
╰─────────┴──────────╯

op.mean(col)

Find the mean of the values of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the mean value.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [1, 5, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.mean(User.val)}
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ 3        │
├─────────┼──────────┤
│  3      │ 6.5      │
╰─────────┴──────────╯

op.std(col)

Find the standard deviation of the values of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the standard deviation.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [1, 5, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.std(User.val)}
)

op.variance(col)

Find the variance of the values of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the variance.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [1, 5, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.variance(User.val)}
)

op.count(col)

Find the count of the values of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the count.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [1, 5, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.count(User.val)}
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ 2        │
├─────────┼──────────┤
│  3      │ 1        │
╰─────────┴──────────╯

Concatenate the string values of col and col2 in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the last value.

col2:

Feature | FeatureWrapper | str | Any

The column with which to concatenate col.

sep:

op.concat_str(col, col2, ...+1)

= ''

The separator to use when concatenating col and col2.

from chalk.features import DataFrame
DataFrame(
    [
        User(id=1, val='a'),
        User(id=1, val='b'),
        User(id=3, val='c'),
        User(id=3, val='d'),
    ]
).group_by(
    group={User.id: User.id},
    agg={User.val: op.concat(User.val)},
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ "ab"     │
├─────────┼──────────┤
│  3      │ "cd"     │
╰─────────┴──────────╯

Deprecated. Use concat instead.

op.last(col)

Find the last value of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the last value.

from chalk.features import DataFrame
DataFrame(
    [
        User(id=1, val=1),
        User(id=1, val=3),
        User(id=3, val=7),
        User(id=3, val=5),
    ]
).sort(
    User.amount, descending=True,
).group_by(
    group={User.id: User.id},
    agg={User.val: op.last(User.val)},
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ 1        │
├─────────┼──────────┤
│  3      │ 5        │
╰─────────┴──────────╯

op.first(col)

Find the first value of col in a DataFrame.

Parameters

col:

Feature | FeatureWrapper | str | Any

The column along which to find the first value.

from chalk.features import DataFrame
DataFrame(
    [
        User(id=1, val=1),
        User(id=1, val=3),
        User(id=3, val=7),
        User(id=3, val=5),
    ]
).sort(
    User.amount, descending=False
).group_by(
    group={User.id: User.id},
    agg={User.val: op.last(User.val)},
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ 1        │
├─────────┼──────────┤
│  3      │ 5        │
╰─────────┴──────────╯

Aggregation

Class

A class for refining an aggregation defined by op.

Attributes

filters

list[Filter]

Functions

Aggregation.where(f)

Filter the aggregation to apply to only rows where all the filters in f are true. If no rows match the filter, the aggregation for the column will be null, and the resulting feature type must be a nullable type.

Parameters

Filter | Any

= ()

A set of filters to apply to the aggregation. Each of the filters must be true to apply the aggregation.

Returns

The aggregation, allowing you to continue to chain methods.

from chalk.features import DataFrame
df = DataFrame(
    {
        User.id: [1, 1, 3],
        User.val: [0.5, 4, 10],
    }
).group_by(
     group={User.id: User.id}
     agg={User.val: op.sum(User.val).where(User.val > 5)}
)

╭─────────┬──────────╮
│ User.id │ User.val │
╞═════════╪══════════╡
│  1      │ null     │
├─────────┼──────────┤
│  3      │ 10       │
╰─────────┴──────────╯

Aggregation.__add__(other)

Aggregation.__radd__(other)

SQL

AthenaSource(name, aws_region, ...+8)

Create an Amazon Athena data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details. DynamoDBSources can be queried via PartiQL SQL resolvers.

You may override the ambient AWS credentials by providing either a client ID and secret, or a role ARN.

Overloads

▲AthenaSource()

▲AthenaSource(name, engine_args)

▲AthenaSource(name, aws_region, ...+8)

BigQuerySource(name, project, ...+5)

Create a BigQuery data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details.

Overloads

▲BigQuerySource()

▲BigQuerySource(name, engine_args)

▲BigQuerySource(name, project, ...+5)

CloudSQLSource(name, instance_name, ...+5)

Create a CloudSQL data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details.

Overloads

▲CloudSQLSource()

▲CloudSQLSource(name, engine_args, ...+1)

▲CloudSQLSource(name, instance_name, ...+5)

DatabricksSource(name, host, ...+5)

Create a Databricks data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details.

Overloads

▲DatabricksSource()

▲DatabricksSource(name, engine_args)

▲DatabricksSource(name, host, ...+5)

DynamoDBSource(name, aws_client_id_override, ...+5)

Create a DynamoDB data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details. DynamoDBSources can be queried via PartiQL SQL resolvers.

You may override the ambient AWS credentials by providing either a client ID and secret, or a role ARN.

Overloads

▲DynamoDBSource()

▲DynamoDBSource(name, engine_args)

▲DynamoDBSource(name, aws_client_id_override, ...+5)

MySQLSource(host, port, ...+6)

Create a MySQL data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details.

Overloads

▲MySQLSource()

▲MySQLSource(name, engine_args, ...+1)

▲MySQLSource(name, host, ...+6)

PostgreSQLSource(host, port, ...+6)

Create a PostgreSQL data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details.

Overloads

▲PostgreSQLSource()

▲PostgreSQLSource(name, engine_args, ...+1)

▲PostgreSQLSource(name, host, ...+6)

RedshiftSource(host, db, ...+5)

Create a Redshift data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details.

Overloads

▲RedshiftSource()

▲RedshiftSource(name, engine_args)

▲RedshiftSource(name, host, ...+5)

SnowflakeSource(name, account_identifier, ...+8)

Create a Snowflake data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details.

Overloads

▲SnowflakeSource()

▲SnowflakeSource(name, engine_args)

▲SnowflakeSource(name, account_identifier, ...+8)

SpannerSource(name, project, ...+5)

Create a Spanner data source. SQL-based data sources created without arguments assume a configuration in your Chalk Dashboard. Those created with the name= keyword argument will use the configuration for the integration with the given name. And finally, those created with explicit arguments will use those arguments to configure the data source. See the overloaded signatures for more details.

Overloads

▲SpannerSource()

▲SpannerSource(name, engine_args)

▲SpannerSource(name, project, ...+5)

SQLiteFileSource(filename, name, ...+2)

Create a SQLite source for a file.

Parameters

filename:

str | PathLike

The name of the file.

name:

SQLSourceWithTableIngestProtocol

= None

The name to use in testing

engine_args:

dict[str, Any] | None

= None

Additional arguments to use when constructing the SQLAlchemy engine.

async_engine_args:

dict[str, Any] | None

= None

Additional arguments to use when constructing an async SQLAlchemy engine.

Returns

type:

The SQL source for use in Chalk resolvers.

SQLiteInMemorySource(name, engine_args, ...+1)

Testing SQL source.

If you have only one SQLiteInMemorySource integration, there's no need to provide a distinguishing name.

Parameters

name:

SQLSourceWithTableIngestProtocol

= None

The name of the integration.

engine_args:

dict[str, Any] | None

= None

Additional arguments to use when constructing the SQLAlchemy engine.

async_engine_args:

dict[str, Any] | None

= None

Additional arguments to use when constructing an async SQLAlchemy engine.

Returns

type:

The SQL source for use in Chalk resolvers.

source = SQLiteInMemorySource(name="RISK")

IncrementalSettings

Class

Incremental settings for Chalk SQL queries.

In "row" mode: incremental_column MUST be set.

Returns the results represented by this query as a list (like .all()), but modifies the query to only return "new" results, by adding a clause that looks like:

"WHERE <incremental_column> >= <previous_latest_row_timestamp> - <lookback_period>"

In "group" mode: incremental_column MUST be set.

Returns the results represented by this query as a list (like .all()), but modifies the query to only results from "groups" which have changed since the last run of the query.

This works by (1) parsing your query, (2) finding the "group keys", (3) selecting only changed groups. Concretely:

SELECT user_id, sum(amount) as sum_amount
FROM payments
GROUP BY user_id

would be rewritten like this:

SELECT user_id, sum(amount) as sum_amount
FROM payments
WHERE user_id in (
    SELECT DISTINCT(user_id)
    FROM payments WHERE created_at >= <previous_latest_row_timestamp> - <lookback_period>
)
GROUP BY user_id

In "parameter" mode: incremental_column WILL BE IGNORED.

This mode is for cases where you want full control of incrementalization. Chalk will not manipulate your query. Chalk will include a query parameter named "chalk_incremental_timestamp". Depending on your SQL dialect, you can use this value to incrementalize your query with :chalk_incremental_timestamp or %(chalk_incremental_timestamp)s. This will incrementalize your query using the timestamp of the latest row that has been ingested.

Chalk will also include another query parameter named "chalk_last_execution_timestamp" that can be used instead. This will incrementalize your query using the last time the query was executed.

incremental_timestamp:

If incremental_timestamp is "feature_time", we will incrementalize your query using the timestamp of the latest row that has been ingested. This is the default.

If incremental_timestamp is "resolver_execution_time", we will incrementalize your query using the last time the query was executed instead.

Attributes

mode

'row' | 'group' | 'parameter'

lookback_period

timedelta | None

The amount of overlap to check for late-arriving rows.

incremental_column

The column on which to incrementalize.

incremental_timestamp

'feature_time' | 'resolver_execution_time'

The timestamp to set as the lower bound

BaseSQLSourceProtocol

Class

Attributes

BaseSQLSourceProtocol.query_string(query, fields, ...+1)

Functions

Run a query from a SQL string.

Parameters

query:

BaseSQLSourceProtocol.query_sql_file(path, fields, ...+1)

The query that you'd like to run.

fields:

dict[str, Feature | str | Any] | None

= None

A mapping from the column names selected to features.

args:

dict[str, object] | None

= None

Any args in the sql string specified by query need to have corresponding value assignments in args.

Returns

type:

StringChalkQueryProtocol

A query that can be returned from a @online or @offline resolver.

Run a query from a SQL file.

This method allows you to query the SQL file within a Python resolver. However, Chalk can also infer resolvers from SQL files. See SQL file resolvers for more information.

Parameters

path:

str | bytes | PathLike

The path to the file with the sql file, relative to the caller's file, or to the directory that your chalk.yaml file lives in.

fields:

dict[str, Feature | str | Any] | None

= None

A mapping from the column names selected to features.

args:

dict[str, object] | None

= None

Any args in the sql file specified by path need to have corresponding value assignments in args.

Returns

type:

StringChalkQueryProtocol

A query that can be returned from a @online or @offline resolver.

BaseSQLSourceProtocol.query(entities)

Query using a SQLAlchemy model.

Parameters

entities:

= ()

Arguments as would normally be passed to a SQLAlchemy.

Returns

type:

BaseSQLSourceProtocol.get_engine()

A query that can be returned from a resolver.

Get an SQLAlchemy Engine. The engine will be created and cached on the first call of this method.

Returns

type:

sqlalchemy.engine.Engine

A SQLAlchemy engine.

SQLSourceWithTableIngestProtocol

Class

TableIngestProtocol

Class

Functions

TableIngestProtocol.with_table(name, features, ...+6)

Automatically ingest a table.

Parameters

name:

The name of the table to ingest.

features:

Type[Features | Any]

The feature class that this table should be mapping to, e.g. User.

ignore_columns:

list[str] | None

= None

Columns in the table that should be ignored, and not mapped to features, even if there is a matching name.

ignore_features:

list[str | Any] | None

= None

Features on the feature class that should be ignored, and not mapped to columns, even if there is a matching name.

require_columns:

list[str] | None

= None

Columns that must exist in the mapping.

require_features:

list[str | Any] | None

= None

Features that must exist in the mapping.

column_to_feature:

dict[str, Any] | None

= None

Explicit mapping of columns to features for names that do not match.

cdc:

bool | IncrementalSettings | None

= None

Settings for incrementally ingesting the table.

from chalk.sql import PostgreSQLSource
from chalk.features import features
PostgreSQLSource().with_table(
    name="users",
    features=User,
).with_table(
    name="accounts",
    features=Account,
    # Override one of the column mappings.
    column_to_feature={
        "acct_id": Account.id,
    },
)

ChalkQueryProtocol.first()

Class

Functions

Return the first result of this Query or None if the result doesn't contain any row.

Returns

type:

SingletonFinalizedChalkQuery

A query that can be returned from a resolver.

ChalkQueryProtocol.one_or_none()

Return at most one result or raise an exception.

Returns None if the query selects no rows. Raises if multiple object identities are returned, or if multiple rows are returned for a query that returns only scalar values as opposed to full identity-mapped entities.

Returns

type:

SingletonFinalizedChalkQuery

A query that can be returned from a resolver.

ChalkQueryProtocol.one()

Return exactly one result or raise an exception.

Returns

type:

SingletonFinalizedChalkQuery

A query that can be returned from a resolver.

ChalkQueryProtocol.all()

Return the results represented by this Query as a DataFrame.

Returns

type:

DataframeFinalizedChalkQuery

A query that can be returned from a resolver.

ChalkQueryProtocol.incremental(lookback_period, mode, ...+3)

Operates like .all(), but tracks previous_latest_row_timestamp between query executions in order to limit the amount of data returned.

previous_latest_row_timestamp will be set the start of the query execution, or if you return a FeatureTime-mapped column, Chalk will update previous_latest_row_timestamp to the maximum observed FeatureTime value.

In "row" mode: incremental_column MUST be set.

Returns the results represented by this query as a list (like .all()), but modifies the query to only return "new" results, by adding a clause that looks like:

WHERE <incremental_column> >= <previous_latest_row_timestamp> - <lookback_period>

In "group" mode: incremental_column MUST be set.

Returns the results represented by this query as a list (like .all()), but modifies the query to only results from "groups" which have changed since the last run of the query.

This works by (1) parsing your query, (2) finding the "group keys", (3) selecting only changed groups. Concretely:

SELECT user_id, sum(amount) as sum_amount
FROM payments
GROUP BY user_id

would be rewritten like this:

SELECT user_id, sum(amount) as sum_amount
FROM payments
WHERE user_id in (
    SELECT DISTINCT(user_id)
    FROM payments WHERE created_at >= <previous_latest_row_timestamp> - <lookback_period>
)
GROUP BY user_id

In "parameter" mode: incremental_column WILL BE IGNORED.

Parameters

lookback_period:

ChalkQueryProtocol.filter_by(kwargs)

= '0s'

Defaults to 0, which means we only return rows that are strictly newer than the last observed row.

mode:

'row' | 'group' | 'parameter'

= 'row'

Defaults to "row", which indicates that only rows newer than the last observed row should be considered. When set to "group", Chalk will only ingest features from groups which are newer than the last observation time. This requires that the query is grouped by a primary key.

incremental_column:

str | Feature | None

= None

This should reference a timestamp column in your underlying table, typically something like "updated_at", "created_at", "event_time", etc.

incremental_timestamp:

'feature_time' | 'resolver_execution_time'

= 'feature_time'

Defaults to "feature_time", which means that the timestamp associated with the last feature value will be used as the incremental time. Alternatively, setting this parameter to "resolver_execution_time" will use last literal timestamp that the resolver ran.

params:

dict[str, Any] | None

= None

Returns

type:

DataframeFinalizedChalkQuery

A query that can be returned from a resolver.

Apply the given filtering criterion to a copy of this Query, using keyword expressions.

Parameters

kwargs:

= {}

The column names assigned to the desired values (i.e. name="Maria").

Returns

type:

ChalkQueryProtocol.filter(criterion)

A query that can be returned from a resolver or further filtered.

from chalk.sql import PostgreSQLSource
session = PostgreSQLSource()
session.query(UserFeatures(id=UserSQL.id)).filter_by(name="Maria")

Apply the given filtering criterion to a copy of this Query, using SQL expressions.

Parameters

criterion:

= ()

SQLAlchemy filter criterion

Returns

type:

ChalkQueryProtocol.order_by(clauses)

A query that can be returned from a resolver or further filtered.

Apply one or more ORDER BY criteria to the query and return the newly resulting Query.

Parameters

clauses:

= ()

SQLAlchemy columns.

Returns

type:

ChalkQueryProtocol.group_by(clauses)

A query that can be returned from a resolver or further filtered.

ChalkQueryProtocol.having(criterion)

ChalkQueryProtocol.union(q)

ChalkQueryProtocol.union_all(q)

ChalkQueryProtocol.intersect(q)

ChalkQueryProtocol.intersect_all(q)

ChalkQueryProtocol.join(target, props, ...+1)

ChalkQueryProtocol.outerjoin(target, props, ...+1)

ChalkQueryProtocol.select_from(from_obj)

ChalkQueryProtocol.execute()

Materialize the query.

Chalk queries are lazy, which allows Chalk to perform performance optimizations like push-down filters. Instead of calling execute, consider returning this query from a resolver as an intermediate feature, and processing that intermediate feature in a different resolver.

Note: this requires the usage of the fields={...} argument when used in conjunction with query_string or query_sql_file.

Returns

The raw result of executing the query. For .all(), returns a DataFrame. For .one() or .one_or_none(), returns a Features instance corresponding to the relevant feature class.

StringChalkQueryProtocol

Class

Functions

StringChalkQueryProtocol.execute()

Materialize the query.

Returns

type:

StringChalkQueryProtocol.one_or_none()

A DataFrame with the results of the query.

Return at most one result or raise an exception.

Returns

type:

SingletonFinalizedChalkQuery

A query that can be returned from a resolver.

StringChalkQueryProtocol.one()

Return exactly one result or raise an exception.

Returns

type:

SingletonFinalizedChalkQuery

A query that can be returned from a resolver.

StringChalkQueryProtocol.all()

Return the results represented by this Query as a list.

Returns

type:

DataframeFinalizedChalkQuery

A query that can be returned from a resolver.

StringChalkQueryProtocol.incremental(incremental_column, lookback_period, ...+2)

Operates like .all(), but tracks previous_latest_row_timestamp between query executions in order to limit the amount of data returned.

In "row" mode: incremental_column MUST be set.

Returns the results represented by this query as a list (like .all()), but modifies the query to only return "new" results, by adding a clause that looks like:

WHERE <incremental_column> >= <previous_latest_row_timestamp> - <lookback_period>

In "group" mode: incremental_column MUST be set.

Returns the results represented by this query as a list (like .all()), but modifies the query to only results from "groups" which have changed since the last run of the query.

This works by (1) parsing your query, (2) finding the "group keys", (3) selecting only changed groups. Concretely:

SELECT user_id, sum(amount) as sum_amount
FROM payments
GROUP BY user_id

would be rewritten like this:

SELECT user_id, sum(amount) as sum_amount
FROM payments
WHERE user_id in (
    SELECT DISTINCT(user_id)
    FROM payments WHERE created_at >= <previous_latest_row_timestamp> - <lookback_period>
)
GROUP BY user_id

In "parameter" mode: incremental_column WILL BE IGNORED.

Parameters

incremental_column:

= None

This should reference a timestamp column in your underlying table, typically something like "updated_at", "created_at", "event_time", etc.

lookback_period:

make_sql_file_resolver(name, sql, ...+15)

= '0s'

Defaults to 0, which means we only return rows that are strictly newer than the last observed row.

mode:

'row' | 'group' | 'parameter'

= 'row'

incremental_timestamp:

'feature_time' | 'resolver_execution_time'

= 'feature_time'

Returns

type:

DataframeFinalizedChalkQuery

A query that can be returned from a resolver.

Generate a Chalk SQL file resolver from a filepath and a sql string. This will generate a resolver in your web dashboard that can be queried, but will not output a .chalk.sql file.

The optional parameters are overrides for the comment key-value pairs at the top of the sql file resolver. Comment key-value pairs specify important resolver information such as the source, feature namespace to resolve, and other details. Note that these will override any values specified in the sql string. See Configuration for more information.

See SQL file resolvers for more information on SQL file resolvers.

Parameters

name:

The name of your resolver

sql:

The sql string for your query.

source:

str | BaseSQLSourceProtocol | None

= None

Can either be a BaseSQLSource or a string. If a string is provided, it will be used to infer the source by first scanning for a source with the same name, then inferring the source if it is a type, e.g. snowflake if there is only one database of that type. Optional if source is specified in sql.

resolves:

str | Any | None

= None

Describes the feature namespace to which the outputs belong. Optional if resolves is specified in sql.

kind:

'online' | 'offline' | 'streaming' | None

= None

The type of resolver. If not specified, defaults to "online".

Other Parameters

▲

incremental:

IncrementalSettings | None

= None

▲

count:

1 | 'one' | 'one_or_none' | 'all' | None

= None

▲

= None

▲

cron:

CronTab | Duration | Cron | None

= None

▲

= None

▲

machine_type:

MachineType | None

= None

▲

fields:

dict[str, str | FeatureReference] | None

= None

▲

= None

▲

tags:

= None

▲

unique_on:

Collection[FeatureReference] | None

= None

▲

partitioned_by:

Collection[Any] | None

= None

▲

total:

@stream(source, mode, ...+7)

= None

from chalk import make_sql_file_resolver
from chalk.features import features
@features
class User:
    id: int
    name: str
make_sql_file_resolver(
    name="my_resolver",
    sql="SELECT user_id as id, name FROM users",
    source="snowflake",
    resolves=User,
    kind="offline",
)

Streams

Decorator to create a stream resolver.

Parameters

source:

StreamSource

The streaming source, e.g. KafkaSource(...) or KinesisSource(...) or PubSubSource(...).

mode:

'continuous' | 'tumbling' | None

= None

This parameter is defined when the streaming resolver returns a windowed feature. Tumbling windows are fixed-size, contiguous and non-overlapping time intervals. You can think of tumbling windows as adjacently arranged bins of equal width. Tumbling windows are most often used alongside max_staleness to allow the features to be sent to the online store and offline store after each window period.

Continuous windows, unlike tumbling window, are overlapping and exact. When you request the value of a continuous window feature, Chalk looks at all the messages received in the window and computes the value on-demand.

See more at https://docs.chalk.ai/docs/windowed-streaming#window-modes

parse:

Callable[[T], V] | None

= None

A callable that will interpret an input prior to the invocation of the resolver. Parse functions can serve many functions, including pre-parsing bytes, skipping unrelated messages, or supporting rekeying.

See more at Parsing

keys:

dict[str, Any] | None

= None

A mapping from input BaseModel attribute to Chalk feature attribute to support continuous streaming re-keying. This parameter is required for continuous resolvers. Features that are included here do not have to be explicitly returned in the stream resolver: the feature will automatically be set to the key value used for aggregation.

timestamp:

updates_materialized_aggregations:

= None

An optional string specifying an input attribute as the timestamp used for windowed aggregations.

= True

If set to False, the stream resolver will not update materialized aggregations, but is still eligible for ETL.

Other Parameters

▲

= None

▲

machine_type:

MachineType | None

= None

▲

@sink(environment, tags, ...+9)

= None

Returns

type:

ResolverProtocol[P, T]

A callable function! You can unit-test stream resolvers as you would unit-test any other code.

Decorator to create a sink. Read more at Sinks

Parameters

= None

Environment can take one of three types:

None (default) - candidate to run in every environment
str - run only in this environment
list[str] - run in any of the specified environment and no others

Read more at Environments

tags:

= None

Allow you to scope requests within an environment. Both tags and environment need to match for a resolver to be a candidate to execute.

You might consider using tags, for example, to change out whether you want to use a sandbox environment for a vendor, or to bypass the vendor and return constant values in a staging environment.

Read more at Tags

buffer_size:

= None

Count of updates to buffer.

= None

The individual or team responsible for this resolver. The Chalk Dashboard will display this field, and alerts can be routed to owners.

Other Parameters

▲

fn:

Callable[P, T] | None

= None

▲

machine_type:

MachineType | None

= None

▲

debounce:

= None

▲

max_delay:

= None

▲

upsert:

= False

▲

integration:

BaseSQLSourceProtocol | SinkIntegrationProtocol | None

= None

▲

name:

= None

Returns

A callable function! You can unit-test sinks as you would unit test any other code. Read more at Unit Tests

@sink
def process_updates(
    uid: User.id,
    email: User.email,
    phone: User.phone,
):
    user_service.update(
        uid=uid,
        email=email,
        phone=phone
    )
process_updates(123, "sam@chalk.ai", "555-555-5555")

KafkaSource

Class

Attributes

bootstrap_server

The URL of one of your Kafka brokers from which to fetch initial metadata about your Kafka cluster

topic

The name of the topic to subscribe to.

ssl_keystore_location

An S3 or GCS URI that points to the keystore file that should be used for brokers. You must configure the appropriate AWS or GCP integration in order for Chalk to be able to access these files.

ssl_ca_file

An S3 or GCS URI that points to the certificate authority file that should be used to verify broker certificates. You must configure the appropriate AWS or GCP integration in order for Chalk to be able to access these files.

'PLAINTEXT' | 'SSL' | 'SASL_PLAINTEXT' | 'SASL_SSL'

Protocol used to communicate with brokers. Valid values are "PLAINTEXT", "SSL", "SASL_PLAINTEXT", and "SASL_SSL". Defaults to "PLAINTEXT".

sasl_mechanism

'PLAIN' | 'GSSAPI' | 'SCRAM-SHA-256' | 'SCRAM-SHA-512' | 'OAUTHBEARER'

Authentication mechanism when security_protocol is configured for SASL_PLAINTEXT or SASL_SSL. Valid values are "PLAIN", "GSSAPI", "SCRAM-SHA-256", "SCRAM-SHA-512", "OAUTHBEARER". Defaults to "PLAIN".

sasl_username

Username for SASL PLAIN, SCRAM-SHA-256, or SCRAM-SHA-512 authentication.

sasl_password

Password for SASL PLAIN, SCRAM-SHA-256, or SCRAM-SHA-512 authentication.

The name of the integration, as configured in your Chalk Dashboard.

late_arrival_deadline

Messages older than this deadline will not be processed.

dead_letter_queue_topic

Kafka topic to send messages when message processing fails

streaming_type

dlq_name

KafkaSource.config_to_json()

Functions

KinesisSource

Class

Attributes

stream_name

The name of your stream. Either this or the stream_arn must be specified

stream_arn

The ARN of your stream. Either this or the stream_name must be specified

region_name

AWS region string, e.g. "us-east-2"

The name of the integration, as configured in your Chalk Dashboard.

late_arrival_deadline

dead_letter_queue_stream_name

Messages older than this deadline will not be processed.

Kinesis stream name to send messages when message processing fails

aws_access_key_id

AWS access key id credential

aws_secret_access_key

AWS secret access key credential

aws_session_token

AWS access key id credential

endpoint_url

optional endpoint to hit Kinesis server

consumer_role_arn

Optional role ARN for the consumer to assume

KinesisSource.config_to_json()

Functions

PubSubSource

Class

Attributes

project_id

The project id of your PubSub source

subscription_id

The subscription id of your PubSub topic from which you want to consume messages. To enable permission for consuming this screen, ensure that the service account has the permissions 'pubsub.subscriptions.consume' and 'pubsub.subscriptions.get'.

The name of the integration, as configured in your Chalk Dashboard.

late_arrival_deadline

dead_letter_queue_topic_id

Messages older than this deadline will not be processed.

PubSub topic id to send messages when message processing fails. Add the permission 'pubsub.topics.publish' if this is set.

PubSubSource.config_to_json()

Functions

StreamSource

Class

Base class for all stream sources generated from @stream.

Attributes

e.g. 'kafka' or 'kinesis' or 'pubsub'

dlq_name

Identifier for the dead-letter queue (DLQ) for the stream. If not specified, failed messages will be dropped. Stream name for Kinesis, topic name for Kafka, subscription id for PubSub.

stream_or_topic_name

StreamSource.config_to_json()

Identifier for the stream to consume. Stream name for Kinesis, topic name for Kafka, subscription id for PubSub

Functions

StreamSource.config_to_dict()

Client

ChalkClient

Class

The ChalkClient is the primary Python interface for interacting with Chalk.

You can use it to query data, trigger resolver runs, gather offline data, and more.

Functions

ChalkClient.__init__(client_id, client_secret, ...+13)

Create a ChalkClient with the given credentials.

Parameters

client_id:

= None

The client ID to use to authenticate. Can either be a service token id or a user token id.

client_secret:

= None

The client secret to use to authenticate. Can either be a service token secret or a user token secret.

EnvironmentId | None

= None

The ID or name of the environment to use for this client. Not necessary if your client_id and client_secret are for a service token scoped to a single environment. If not present, the client will use the environment variable CHALK_ENVIRONMENT.

api_server:

= None

The API server to use for this client. Required if you are using a Chalk Dedicated deployment. If not present, the client will check for the presence of the environment variable CHALK_API_SERVER, and use that if found.

branch:

BranchId | None | True

= None

If specified, Chalk will route all requests from this client instance to the relevant branch. Some methods allow you to override this instance-level branch configuration by passing in a branch argument.

If True, the client will pick up the branch from the current git branch.

deployment_tag:

= None

If specified, Chalk will route all requests from this client instance to the relevant tagged deployment. This cannot be used with the branch argument.

preview_deployment_id:

DeploymentId | None

= None

If specified, Chalk will route all requests from this client instance to the relevant preview deployment.

session:

requests.Session | None

= None

A requests.Session to use for all requests. If not provided, a new session will be created.

query_server:

= None

The query server to use for this client. Required if you are using a standalone Chalk query engine deployment. If not present, the client will default to the value of api_server.

additional_headers:

dict[str, str] | None

= None

A map of additional HTTP headers to pass with each request.

default_job_timeout:

float | timedelta | None

= None

The default wait timeout, in seconds, to wait for long-running jobs to complete when accessing query results. Jobs will not time out if this timeout elapses. For no timeout, set to None. The default is no timeout.

default_request_timeout:

float | timedelta | None

= None

The default wait timeout, in seconds, to wait for network requests to complete. If not specified, the default is no timeout.

default_connect_timeout:

float | timedelta | None

= None

The default connection timeout, in seconds, to wait for establishing a connection. This is separate from the request timeout and controls only the connection phase. If not specified, the default is no timeout.

local:

ChalkClient.query(input, output, ...+20)

= False

If True, point the client at a local version of the code.

ssl_context:

ssl.SSLContext | None

= None

A ssl.SSLContext that can be loaded with self-signed certificates so that requests requests to servers hosted with self-signed certificates succeed.

Raises

error:

ChalkAuthException

If client_id or client_secret are not provided, there is no ~/.chalk.yml file with applicable credentials, and the environment variables CHALK_CLIENT_ID and CHALK_CLIENT_SECRET are not set.

Compute features values using online resolvers. See Chalk Clients for more information.

Parameters

input:

dict[FeatureReference, Any] | Any

The features for which there are known values, mapped to those values. For example, {User.id: 1234}. Features can also be expressed as snakecased strings, e.g. {"user.id": 1234}

output:

Sequence[FeatureReference]

= ()

Outputs are the features that you'd like to compute from the inputs. For example, [User.age, User.name, User.email].

If an empty sequence, the output will be set to all features on the namespace of the query. For example, if you pass as input {"user.id": 1234}, then the query is defined on the User namespace, and all features on the User namespace (excluding has-one and has-many relationships) will be used as outputs.

now:

datetime | None

= None

The time at which to evaluate the query. If not specified, the current time will be used. This parameter is complex in the context of online_query since the online store only stores the most recent value of an entity's features. If now is in the past, it is extremely likely that None will be returned for cache-only features.

This parameter is primarily provided to support:

controlling the time window for aggregations over cached has-many relationships
controlling the time window for aggregations over has-many relationships loaded from an external database

If you are trying to perform an exploratory analysis of past feature values, prefer offline_query.

staleness:

dict[FeatureReference, str] | None

= None

Maximum staleness overrides for any output features or intermediate features. See Query Caching for more information.

EnvironmentId | None

= None

The environment under which to run the resolvers. API tokens can be scoped to an environment. If no environment is specified in the query, but the token supports only a single environment, then that environment will be taken as the scope for executing the request.

tags:

list[str] | None

= None

The tags used to scope the resolvers. See Tags for more information.

preview_deployment_id:

= None

If specified, Chalk will route your request to the relevant preview deployment.

branch:

BranchId | None

= ...

If specified, Chalk will route your request to the relevant branch.

correlation_id:

= None

You can specify a correlation ID to be used in logs and web interfaces. This should be globally unique, i.e. a uuid or similar. Logs generated during the execution of your query will be tagged with this correlation id.

query_name:

= None

The semantic name for the query you're making, for example, "loan_application_model". Typically, each query that you make from your application should have a name. Chalk will present metrics and dashboard functionality grouped by 'query_name'. If your query name matches a NamedQuery, the query will automatically pull outputs and options specified in the matching NamedQuery.

query_name_version:

= None

If query_name is specified, this specifies the version of the named query you're making. This is only useful if you want your query to use a NamedQuery with a specific name and a specific version. If a query_name has not been supplied, then this parameter is ignored.

include_meta:

= False

Returns metadata about the query execution under OnlineQueryResult.meta. This could make the query slightly slower. For more information, see Chalk Clients.

explain:

= False

Log the query execution plan. Requests using explain=True will be slower than requests using explain=False.

If True, 'include_meta' will be set to True as well.

store_plan_stages:

= False

If True, the output of each of the query plan stages will be stored. This option dramatically impacts the performance of the query, so it should only be used for debugging.

encoding_options:

FeatureEncodingOptions | None

= None

required_resolver_tags:

list[str] | None

= None

If specified, all required_resolver_tags must be present on a resolver for it to be considered eligible to execute. See Tags for more information.

query_context:

= None

An immutable context that can be accessed from Python resolvers. This context wraps a JSON-compatible dictionary or JSON string with type restrictions. See https://docs.chalk.ai/api-docs#ChalkContext for more information.

Other Parameters

▲

meta:

dict[str, str] | None

= None

▲

planner_options:

dict[str, str | int | bool] | None

= None

▲

request_timeout:

= None

▲

connect_timeout:

= None

▲

headers:

dict[str, str] | None

= None

Returns

type:

OnlineQueryResult

Wrapper around the output features and any query metadata and errors encountered while running the resolvers.

data

list[FeatureResult]

The output features and any query metadata.

errors

list[ChalkError] | None

Errors encountered while running the resolvers.

If no errors were encountered, this field is empty.

Examples

from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now(tz=timezone.utc)
dataset: Dataset = ChalkClient().offline_query(
    input={
        User.id: uids,
    },
    input_times=[at] * len(uids),
    output=[
        User.id,
        User.fullname,
        User.email,
        User.name_email_match_score,
    ],
    dataset_name='my_dataset'
)
df = dataset.get_data_as_pandas()
df.recompute(features=[User.fraud_score], branch="feature/testing")

Attributes

is_finished

Whether the most recent DatasetRevision is finished or still pending.

version

Sequence[DatasetRevision]

Storage version number of outputs.

revisions

A list of all DatasetRevision instances belonging to this dataset.

dataset_name

The unique name for this dataset, if given.

dataset_id

uuid.UUID | None

The unique UUID for this dataset.

errors

Sequence[ChalkError] | None

A list of errors in loading the dataset, if they exist.

Functions

Dataset.to_polars()

Loads a pl.DataFrame containing the output. Use .to_polars_lazyframe() if you want a LazyFrame instead, which allows local filtering of datasets that are larger than memory.

Parameters

None

Other Parameters

▲

= False

▲

= False

▲

= False

▲

bool | ...

= ...

▲

Dataset.to_polars_async()

float | timedelta | None | ...

= ...

Returns

type:

pl.DataFrame

A pl.DataFrame materializing query output data.

Loads a pl.DataFrame containing the output. Use .to_polars_lazyframe() if you want a LazyFrame instead, which allows local filtering of datasets that are larger than memory.

Parameters

None

Other Parameters

▲

= False

▲

= False

▲

= False

▲

bool | ...

= ...

▲

float | timedelta | None | ...

= ...

Returns

type:

pl.DataFrame

A pl.DataFrame materializing query output data.

Dataset.to_arrow()

Loads a pa.Table from the raw Parquet file outputs.

Parameters

None

Other Parameters

▲

= False

▲

bool | ...

= ...

▲

Dataset.arrow_schema(ignore_errors, show_progress, ...+1)

float | timedelta | None | ...

= ...

Returns

type:

pa.Table

A pa.Table materializing query output data.

Returns the schema of the output data.

Parameters

= False

Whether to ignore query errors upon fetching data

bool | ...

= ...

Whether to show a progress bar. Defaults to True.

Dataset.to_polars_lazyframe()

float | timedelta | ... | None

= ...

Returns

type:

pa.Schema

The schema of the output data.

Loads a pl.LazyFrame containing the output. This method is appropriate for working with larger-than-memory datasets. Use .to_polars() if you want a DataFrame instead.

Parameters

None

Other Parameters

▲

= False

▲

= False

▲

= False

▲

bool | ...

= ...

▲

float | timedelta | None | ...

= ...

Returns

type:

Dataset.get_data_as_polars()

A pl.LazyFrame materializing query output data.

Loads a pl.LazyFrame containing the output.

Parameters

None

Other Parameters

▲

= False

▲

= False

▲

= False

▲

bool | ...

= ...

▲

float | timedelta | None | ...

= ...

Returns

type:

Dataset.get_data_as_pandas()

A pl.LazyFrame materializing query output data.

Loads a pd.DataFrame containing the output.

Parameters

None

Other Parameters

▲

= False

▲

= False

▲

= False

▲

bool | ...

= ...

▲

Dataset.get_data_as_dataframe()

float | timedelta | None | ...

= ...

Returns

type:

pd.DataFrame

A pd.DataFrame materializing query output data.

Loads a Chalk DataFrame containing the output. Requires the pertinent Chalk features to be accessible via import

Parameters

None

Other Parameters

▲

= False

▲

= False

▲

= False

▲

bool | ...

= ...

▲

float | timedelta | None | ...

= ...

Returns

type:

Dataset.to_pandas(output_id, output_ts, ...+3)

A DataFrame materializing query output data.

Loads a pd.DataFrame containing the output of the most recent revision.

Parameters

= False

Whether to return the primary key feature in a column named "__chalk__.__id__" in the resulting pd.DataFrame.

= False

Whether to return the input-time feature in a column named "__chalk__.CHALK_TS" in the resulting pd.DataFrame. If set to a non-empty str, used as the input-time column name.

= False

Whether to ignore query errors upon fetching data

bool | ...

= ...

Whether to show a progress bar. Defaults to True.

Dataset.download_uris(output_id, output_ts, ...+3)

float | timedelta | None | ...

= ...

Returns

type:

pd.DataFrame

A pd.DataFrame materializing query output data.

Returns a list of the output uris for the revision. Data will be stored in .Parquet format. The URIs should be considered temporary, and will expire after a server-defined time period.

Parameters

= False

Whether to return the primary key feature in a column named "__chalk__.__id__" in the resulting pd.DataFrame.

= False

Whether to return the input-time feature in a column named "__chalk__.CHALK_TS" in the resulting pd.DataFrame. If set to a non-empty str, used as the input-time column name.

= False

Whether to ignore query errors upon fetching data

bool | ...

= ...

Whether to show a progress bar. Defaults to True.

Dataset.wait(timeout, show_progress)

float | timedelta | None | ...

= ...

Waits for an offline query job to complete. Returns a list of errors if unsuccessful, or None if successful.

Parameters

float | timedelta | ... | None

= ...

Dataset.download_data(path, executor, ...+3)

bool | ...

= ...

Whether to show a progress bar. Defaults to True.

Downloads output files pertaining to the revision to the given path.

Datasets are stored in Chalk as sharded Parquet files. With this method, you can download those raw files into a directory for processing with other tools.

Parameters

path:

A directory where the Parquet files from the dataset will be downloaded.

executor:

ThreadPoolExecutor | None

= None

An executor to use to download the data in parallel. If not specified, the default executor will be used.

= False

Whether to ignore query errors upon fetching data.

bool | ...

= ...

Whether to show a progress bar. Defaults to True..

Dataset.get_input_dataframe(ignore_errors, show_progress, ...+1)

float | timedelta | None | ...

= ...

How long to wait, in seconds, for job completion before raising a TimeoutError. Jobs will continue to run in the background if they take longer than this timeout. For no timeout, set to None. If no timeout is specified, the client's default timeout is used.

from chalk.client import ChalkClient, Dataset
from datetime import datetime, timezone
uids = [1, 2, 3, 4]
at = datetime.now(tz=timezone.utc)
dataset = ChalkClient().offline_query(
    input={User.id: uids},
    input_times=[at] * len(uids),
    output=[
        User.id,
        User.fullname,
        User.email,
        User.name_email_match_score,
    ],
    dataset_name='my_dataset',
)
dataset.download_data('my_directory')

Dataset.summary()

Dataset.preview()

Loads a pl.LazyFrame containing the inputs that were used to create the dataset.

Parameters

= False

Whether to ignore query errors upon fetching data

bool | ...

= ...

Whether to show a progress bar. Defaults to True.

float | timedelta | None | ...

= ...

Returns

type:

Dataset.open_in_browser(return_url_only)

A pl.LazyFrame materializing query input data.

Returns and opens a url that opens the offline query page in the Chalk dashboard. Must be logged in.

Parameters

return_url_only:

= False

If True, does not open url in browser. Default is False.

Returns

type:

Dataset.recompute(features, branch, ...+10)

A url redirecting to the Chalk dashboard.

Creates a new revision of this Dataset by recomputing the specified features.

Carries out the new computation on the branch specified when constructing the client.

Parameters

features:

list[FeatureReference] | None

= None

A list of specific features to recompute. Features that don't exist in the dataset will be added. Features that already exist in the dataset will be recomputed. If not provided, all the existing features in the dataset will be recomputed.

branch:

= None

wait:

= True

bool | ...

= ...

If True, progress bars will be shown while recomputation is running. This flag will also be propagated to the methods of the resulting Dataset.

store_plan_stages:

= False

correlation_id:

= None

explain:

= False

tags:

list[str] | None

= None

The tags used to scope the resolvers. See Tags for more information.

required_resolver_tags:

list[str] | None

= None

If specified, all required_resolver_tags must be present on a resolver for it to be considered eligible to execute. See Tags for more information.

planner_options:

dict[str, str | int | bool] | None

= None

run_asynchronously:

= False

Boots a kubernetes job to run the queries in their own pods, separate from the engine and branch servers. This is useful for large datasets and jobs that require a long time to run. This must be specified as True to run this job asynchronously, even if the previous revision was run asynchronously.

Dataset.ingest(store_online, store_offline)

float | timedelta | None | ...

= ...

Raises

error:

ValueError

If no branch was provided to the Chalk Client.

from chalk.client import ChalkClient
dataset = ChalkClient(branch="data_science").offline_query(...)
df = dataset.get_data_as_polars()
# make changes to resolvers in your project
dataset.recompute()
new_df = dataset.get_data_as_polars() # receive newly computed data

Saves the latest revision of this dataset to Chalk's online and offline storage.

Parameters

store_online:

= False

Whether to store the revision in Chalk's online storage.

store_offline:

Dataset.resolver_replay(resolver, show_progress, ...+1)

= True

Whether to store the revision in Chalk's offline storage.

Downloads the resolver replay data for the given resolver in the latest revision of the dataset.

The replay data is functionally similar to viewing the intermediate results on the plan explorer.

Parameters

resolver:

ResolverProtocol

The resolver to download the replay data for, or its fqn.

bool | ...

= ...

Whether to show progress bars

Dataset.write_to(destination, catalog)

float | timedelta | ... | None

= ...

Writes the dataset to a given destination.

Parameters

destination:

Dataset.set_metadata(metadata)

The destination to write the dataset to.

catalog:

BaseCatalog | None

= None

The catalog to use for writing the dataset.

Set metadata for the latest dataset revision of the dataset.

Parameters

metadata:

dict[str, Any]

The metadata (as a dict) that you want to set for a given revision—this will fully replace any metadata that has already been previously set.

from chalk.client import ChalkClient, Dataset
dataset: Dataset = ChalkClient().get_dataset(dataset_name='my_dataset_name')
dataset.set_metadata(
    {"metadata": "test"}
)

ErrorCodeCategory

Enum

The category of an error.

For more detailed error information, see ErrorCode

Values

REQUEST

Request errors are raised before execution of your resolver code. They may occur due to invalid feature names in the input or a request that cannot be satisfied by the resolvers you have defined.

FIELD

Field errors are raised while running a feature resolver for a particular field. For this type of error, you'll find a feature and resolver attribute in the error type. When a feature resolver crashes, you will receive null value in the response. To differentiate from a resolver returning a null value and a failure in the resolver, you need to check the error schema.

NETWORK

Network errors are thrown outside your resolvers. For example, your request was unauthenticated, connection failed, or an error occurred within Chalk.

ErrorCode

Enum

The detailed error code.

For a simpler category of error, see ErrorCodeCategory.

Values

PARSE_FAILED

The query contained features that do not exist.

RESOLVER_NOT_FOUND

A resolver was required as part of running the dependency graph that could not be found.

INVALID_QUERY

The query is invalid. All supplied features need to be rooted in the same top-level entity.

VALIDATION_FAILED

A feature value did not match the expected schema (e.g. incompatible type "int"; expected "str")

RESOLVER_FAILED

The resolver for a feature errored.

RESOLVER_TIMED_OUT

The resolver for a feature timed out.

UPSTREAM_FAILED

A crash in a resolver that was to produce an input for the resolver crashed, and so the resolver could not run crashed, and so the resolver could not run.

UNAUTHENTICATED

The request was submitted with an invalid authentication header.

UNAUTHORIZED

The supplied credentials do not provide the right authorization to execute the request.

INTERNAL_SERVER_ERROR

An unspecified error occurred.

CANCELLED

The operation was cancelled, typically by the caller.

DEADLINE_EXCEEDED

The deadline expired before the operation could complete.

ChalkException

Exception

Information about an exception from a resolver run.

Attributes

kind

The name of the class of the exception.

message

The message taken from the exception.

stacktrace

The stacktrace produced by the code.

internal_stacktrace

ChalkException.from_exception(exc)

The stacktrace produced by the code, full detail.

Functions

ChalkException.create(kind, message, ...+2)

ResourceRequests

Class

Override resource requests for processes with isolated resources, e.g., offline queries and cron jobs. Note that making these too large could prevent your job from being scheduled, so please test before using these in a recurring pipeline.

Attributes

cpu

CPU requests: Increasing this will make some Chalk operations that are parallel and CPU-bound faster. Default unit is physical CPU cores, i.e. "8" means 8 CPU cores, "0.5" means half of a CPU core. An alternative unit is "millicore", which is one-thousandth of a CPU core, i.e. 500m is half of a CPU core.

memory

Memory requests: you can use these to give your pod more memory, i.e. to prevent especially large jobs from OOMing. Default unit is bytes, i.e. 1000000000 is 1 gigabyte of memory. You can also specify a suffix such as K, M, or G for kilobytes, megabytes, and gigabytes, respectively. It's also possible to use the power of two equivalents, such as Ki, Mi, and Gi.

ephemeral_volume_size

Chalk can use this for spilling intermediate state of some large computations, i.e. joins, aggregations, and sorting. Default unit is bytes, i.e. 1000000000 is 1 gigabyte of memory. You can also specify a suffix such as K, M, or G for kilobytes, megabytes, and gigabytes, respectively. It's also possible to use the power of two equivalents, such as Ki, Mi, and Gi.

ephemeral_storage

Ephemeral storage for miscellaneous file system access. Should probably not be below 1Gi to ensure there's enough space for the Docker image, etc. Should also not be too high or else the pod will not be scheduled.

resource_group

Resource group to use for this job. If not specified, the default resource group will be used.

OnlineQuery

Class

Attributes

dict[str, str] | None

tags

list[str] | None

The tags used to scope the resolvers. More information at Tags

required_resolver_tags

list[str] | None

ChalkAuthException

Exception

Raised when constructing a ChalkClient without valid credentials.

When this exception is raised, no explicit client_id and client_secret were provided, there was no ~/.chalk.yml file with applicable credentials, and the environment variables CHALK_CLIENT_ID and CHALK_CLIENT_SECRET were not set.

You may need to run chalk login from your command line, or check that your working directory is set to the root of your project.

Utils

owner(f)

Get the owner for a feature, feature class, or resolver.

Parameters

A feature (User.email), feature class (User), or resolver (get_user)

Returns

type:

The owner for a feature or feature class, if it exists. Note that the owner of a feature could be inherited from the feature class.

Raises

error:

If the supplied variable is not a feature, feature class, or resolver.

@features(owner="ship")
class RocketShip:
    id: int
    software_version: str
owner(RocketShip.software_version)

'ship'

tags(f)

Get the tags for a feature, feature class, or resolver.

Parameters

A feature (User.email), feature class (User), or resolver (get_user)

Returns

type:

Sequence[str] | None

The tags for a feature, feature class, or resolver, if it exists. Note that the tags of a feature could be inherited from the feature class.

Raises

error:

If the supplied variable is not a feature, feature class, or resolver.

Feature tags

@features(tags="group:risk")
class User:
    id: str
    # :tags: pii
    email: str
tags(User.id)

['group:risk']

Feature class tags

tags(User)

['group:risk']

Feature + feature class tags

tags(User.email)

['pii', 'group:risk']

description(f)

Get the description of a feature, feature class, or resolver.

Parameters

A feature (User.email), feature class (User), or resolver (get_user)

Returns

type:

The description for a feature, feature class, or resolver, if it exists.

Raises

error:

If the supplied variable is not a feature, feature class, or resolver.

@features
class RocketShip:
    # Comments above a feature become
    # descriptions for the feature!
    software_version: str
description(RocketShip.software_version)

'Comments above a feature become descriptions for the feature!'

is_primary(f)

Determine whether a feature is a primary key.

Parameters

A feature (i.e. User.email)

Returns

type:

True if f is primary and False otherwise.

Raises

error:

If f is not a feature.

from chalk.features import features
from chalk import Primary
@features
class User:
    uid: Primary[int]
    email: str
assert is_primary(User.uid)
assert not is_primary(User.email)

is_feature_time(f)

Determine whether a feature is a feature time. See Time overview for more details on FeatureTime.

Parameters

A feature (i.e. User.ts)

Returns

type:

True if the feature is a FeatureTime and False otherwise.

from chalk.features import features, FeatureTime
@features
class User:
    id: str
    updated_at: FeatureTime
assert is_feature_time(User.updated_at) is True
assert is_feature_time(User.id) is False

ChalkBaseException

Exception

The base type for Chalk exceptions.

This exception makes error handling easier, as you can look only for this exception class.

Attributes

message

A message describing the specific type of exception raised.

full_message

ScheduledQuery.__init__(name, schedule, ...+12)

A message that describes the specific type of exception raised and contains the readable representation of each error in the errors attribute.

Also includes the trace ID if one is available.

ScheduledQueries

ScheduledQuery

Class

Functions

Create an offline query which runs on a schedule.

Scheduled queries do not produce datasets, but persist their results in the online and/or offline feature stores.

By default, scheduled queries use incrementalization to only ingest data that has been updated since the last run.

Parameters

name:

A unique name for the scheduled query. The name of the scheduled query will show up in the dashboard and will be uset to set the incremetalization metadata.

schedule:

CronTab | Duration

A cron schedule or a Duration object representing the interval at which the query should run.

output:

Collection[FeatureReference]

The features that this query will compute. Namespaces are exploded into all features in the namespace.

recompute_features:

bool | Collection[FeatureReference]

= True

Whether to recompute all features or load from the feature store. If True, all features will be recomputed. If False, all features will be loaded from the feature store. If a list of features, only those features will be recomputed, and the rest will be loaded from the feature store.

max_samples:

= None

The maximum number of samples to compute.

lower_bound:

datetime | None

= None

A hard-coded lower bound for the query. If set, the query will not use incrementalization.

upper_bound:

datetime | None

= None

A hard-coded upper bound for the query. If set, the query will not use incrementalization.

tags:

Collection[str] | None

= None

Allows selecting resolvers with these tags.

required_resolver_tags:

Collection[str] | None

= None

Requires that resolvers have these tags.

store_online:

= True

Whether to store the results of this query in the online store.

store_offline:

= True

Whether to store the results of this query in the offline store.

incremental_resolvers:

Collection[str] | None

= None

If set to None, Chalk will incrementalize resolvers in the query's root namespaces. If set to a list of resolvers, this set will be used for incrementalization. Incremental resolvers must return a feature time in its output, and must return a DataFrame. Most commonly, this will be the name of a SQL file resolver. Chalk will ingest all new data from these resolvers and propagate changes to values in the root namespace.

planner_options:

dict[str, str] | None

= None

A dictionary of options to pass to the planner. These are typically provided by Chalk Support for specific use cases.

resource_group:

before_all(fn, environment)

= None

The resource group to use for the query. If not set, the default resource group will be used.

Returns

A scheduled query object.

from chalk.queries import ScheduledQuery
# this scheduled query will automatically run every 5 minutes after `chalk apply`
ScheduledQuery(
    name="ingest_users",
    schedule="*/5 * * * *",
    output=[User],
    store_online=True,
    store_offline=True,
)

Hooks

after_all(fn, environment)

Charts

Chart.__init__(name, window_period)

Class

Class describing a single visual metric.

Attributes

registry

ClassVar[set[Chart]]

Functions

Create a chart for monitoring or alerting on the Chalk dashboard.

Parameters

name:

The name of the chart. If a name is not provided, the chart will be named according to the series and formulas it contains.

window_period:

= '1h'

The length of the window, e.g. "20m" or "1h".

Other Parameters

▲

keep:

= False

Returns

A chart for viewing in the Chalk dashboard.

from chalk.monitoring import Chart, Series
Chart(name="Request count").with_trigger(
    Series
        .feature_null_ratio_metric()
        .where(feature=User.fico_score) > 0.2,
)

Chart.with_name(name)

Override the name of a chart.

Parameters

name:

A new name for a chart.

Returns

type:

Chart.with_window_period(window_period)

A copy of your Chart with the new name.

Change the window period for a Chart.

Parameters

window_period:

A new window period for a chart, e.g. "20m" or "1h".

Returns

type:

Chart.with_series(series)

A copy of your Chart with the new window period.

Attaches a Series to your Chart instance.

Parameters

series:

SeriesBase

A Series instance to attach to the Chart. A Chart can have any number of Series.

Returns

type:

Chart.get_series(series_name)

A copy of your chart with the new name

Get a Series from your Chart by series name.

It is advised to use different series names within your charts.

Parameters

series_name:

Chart.with_formula(formula, name, ...+2)

The name of the Series.

Returns

type:

SeriesBase

The first series added to your Chart with the given series name.

Chart.with_trigger(expression, trigger_name, ...+3)

Attaches a Trigger to your Chart. Your Chart may optionally have one Trigger.

Parameters

expression:

ThresholdFunction

Triggers are applied when a certain series is above or below a given value. The expression specifies the series, operand, and value as follows

the left-hand side of the expression must be a Series instance.
the operand must be < or >
the right-hand side must be an int or float Thus, if we have a Series instance series1, expression=series1 > 0.5 will result in an alert when series is greater than 0.5.

trigger_name:

= None

The name for the new trigger.

severity:

AlertSeverityKind

= AlertSeverityKind.INFO

The severity of the trigger.

critical
error
warning
info

channel_name:

= None

The owner or email of the trigger.

description:

= None

A description to your Trigger. Descriptions provided here will be included in the alert message in Slack or PagerDuty.

For Slack alerts, you can use the mrkdwn syntax described here: https://api.slack.com/reference/surfaces/formatting#basics

Returns

type:

Chart.with_feature_link(feature)

A copy of your Chart with the new trigger.

Explicitly link a Chart to a feature. This chart will then be visible on the webpage for this feature. Charts may only be linked to one entity.

Parameters

feature:

A Chalk feature

Returns

type:

Chart.with_resolver_link(resolver)

A copy of your chart linked to the feature.

Explicitly link a chart to a resolver. This chart will then be visible on the webpage for this resolver. Charts may only be linked to one entity.

Parameters

resolver:

ResolverProtocol

A Chalk resolver.

Returns

type:

Chart.with_query_link(query_name)

A copy of your chart linked to the resolver.

Explicitly link a chart to a query. This chart will then be visible on the webpage for this query. Charts may only be linked to one entity.

Parameters

query_name:

A name of a Chalk query

Returns

type:

A copy of your chart linked to the query.

Chart.keep()

Designates that this chart and all of its descendants will be registered.

Returns

type:

The same chart.

Chart.__getitem__(key)

Retrieve a series or formula by name from a chart.

Parameters

key:

Series.feature_request_count_metric(name)

The name of the series or formula to retrieve.

Returns

type:

SeriesBase | _Formula

Chart.__eq__(obj)

Series

Class

Class describing a series of data in two dimensions, as in a line chart. Series should be instantiated with one of the classmethods that specifies the metric to be tracked.

Functions

Creates a Series of metric kind FeatureRequestCount.

Parameters

name:

Series.feature_staleness_metric(window_function, name)

= None

A name for your new feature_request_count Series. If no name is provided, one will be created.

Returns

type:

FeatureRequestCountSeries

A new FeatureRequestCountSeries instance that inherits from the Series class.

Creates a Series of metric kind FeatureStaleness.

Parameters

'mean' | 'max' | '99%' | '95%' | '75%' | '50%' | '25%' | '5%' | 'min' | 'all'

The time window to calculate the metric over.

name:

Series.feature_value_metric(window_function, name)

= None

A name for your new feature_staleness Series. If not provided, a name will be generated for you.

Returns

type:

FeatureStalenessSeries

A new FeatureStalenessSeries instance that inherits from the Series class.

Creates a Series of metric kind FeatureValue.

Parameters

'mean' | 'max' | '99%' | '95%' | '75%' | '50%' | '25%' | '5%' | 'min' | 'all'

The time window to calculate the metric over.

name:

Series.feature_null_ratio_metric(name)

= None

A name for your new feature_value Series. If not provided, a name will be generated for you.

Returns

type:

FeatureValueSeries

A new FeatureValueSeries instance that inherits from the Series class.

Creates a Series of metric kind FeatureNullRatio.

Parameters

name:

Series.resolver_request_count_metric(name)

= None

A name for your new feature_null_ratio Series. If no name is provided, one will be created.

Returns

type:

FeatureNullRatioSeries

A new FeatureNullRatioSeries instance that inherits from the Series class.

Creates a Series of metric kind ResolverRequestCount.

Parameters

name:

Series.resolver_latency_metric(window_function, name)

= None

A name for your new resolver_request_count Series. If no name is provided, one will be created.

Returns

type:

ResolverRequestCountSeries

A new ResolverRequestCountSeries instance that inherits from the Series class.

Creates a Series of metric kind ResolverLatency.

Parameters

'mean' | 'max' | '99%' | '95%' | '75%' | '50%' | '25%' | '5%' | 'min' | 'all'

The time window to calculate the metric over.

name:

Series.resolver_success_ratio_metric(name)

= None

A name for your new resolver_latency Series. If not provided, a name will be generated for you.

Returns

type:

ResolverLatencySeries

A new ResolverLatencySeries instance that inherits from the Series class.

Creates a Series of metric kind ResolverSuccessRatio.

Parameters

name:

Series.query_count_metric(name)

= None

A name for your new resolver_success_ratio Series. If no name is provided, one will be created.

Returns

type:

ResolverSuccessRatioSeries

A new ResolverSuccessRatioSeries instance that inherits from the Series class.

Creates a Series of metric kind QueryCount.

Parameters

name:

Series.query_latency_metric(window_function, name)

= None

A name for your new query_count Series. If no name is provided, one will be created.

Returns

type:

QueryCountSeries

A new QueryCountSeries instance that inherits from the Series class.

Creates a Series of metric kind QueryLatency.

Parameters

'mean' | 'max' | '99%' | '95%' | '75%' | '50%' | '25%' | '5%' | 'min' | 'all'

The time window to calculate the metric over.

name:

Series.query_success_ratio_metric(name)

= None

A name for your new query_latency Series. If not provided, a name will be generated for you.

Returns

type:

QueryLatencySeries

A new QueryLatencySeries instance that inherits from the Series class.

Creates a Series of metric kind QuerySuccessRatio.

Parameters

name:

Series.cron_count_metric(name)

= None

A name for your new query_success_ratio Series. If no name is provided, one will be created.

Returns

type:

QuerySuccessRatioSeries

A new QuerySuccessRatioSeries instance that inherits from the Series class.

Creates a Series of metric kind CronCount.

Parameters

name:

Series.cron_latency_metric(window_function, name)

= None

A name for your new cron_count Series. If no name is provided, one will be created.

Returns

type:

CronCountSeries

A new CronCountSeries instance that inherits from the Series class.

Creates a Series of metric kind CronLatency.

Parameters

'mean' | 'max' | '99%' | '95%' | '75%' | '50%' | '25%' | '5%' | 'min' | 'all'

The time window to calculate the metric over.

name:

Series.stream_message_latency_metric(window_function, name)

= None

A name for your new cron_latency Series. If not provided, a name will be generated for you.

Returns

type:

CronLatencySeries

A new CronLatencySeries instance that inherits from the Series class.

Creates a Series of metric kind StreamMessageLatency.

Parameters

'mean' | 'max' | '99%' | '95%' | '75%' | '50%' | '25%' | '5%' | 'min' | 'all'

The time window to calculate the metric over.

name:

Series.stream_messages_processed_metric(name)

= None

A name for your new stream_message_latency Series. If not provided, a name will be generated for you.

Returns

type:

StreamMessageLatencySeries

A new StreamMessageLatencySeries instance that inherits from the Series class.

Creates a Series of metric kind StreamMessagesProcessed.

Parameters

name:

Series.stream_windows_processed_metric(name)

= None

A name for your new stream_messages_processed Series. If no name is provided, one will be created.

Returns

type:

StreamMessagesProcessedSeries

A new StreamMessagesProcessedSeries instance that inherits from the Series class.

Creates a Series of metric kind StreamWindowsProcessed.

Parameters

name:

Series.stream_window_latency_metric(window_function, name)

= None

A name for your new stream_windows_processed Series. If no name is provided, one will be created.

Returns

type:

StreamWindowsProcessedSeries

A new StreamWindowsProcessedSeries instance that inherits from the Series class.

Creates a Series of metric kind StreamWindowLatency.

Parameters

'mean' | 'max' | '99%' | '95%' | '75%' | '50%' | '25%' | '5%' | 'min' | 'all'

The time window to calculate the metric over.

name:

Series.stream_lag_metric(window_function, name)

= None

A name for your new stream_window_latency Series. If not provided, a name will be generated for you.

Returns

type:

StreamWindowLatencySeries

A new StreamWindowLatencySeries instance that inherits from the Series class.

Creates a Series of metric kind StreamLag.

Parameters

'mean' | 'max' | '99%' | '95%' | '75%' | '50%' | '25%' | '5%' | 'min' | 'all'

The time window to calculate the metric over.

name:

assert_frame_equal(left, right, ...+2)

= None

A name for your new stream_lag Series. If not provided, a name will be generated for you.

Returns

type:

StreamLagSeries

A new StreamLagSeries instance that inherits from the Series class.

Testing

Given two DataFrames, left and right, check if left == right, and raise otherwise.

Parameters

left:

The DataFrame to compare.

right:

The DataFrame to compare with.

check_column_order:

= True

If False, allows the assert/test to succeed if the required columns are present, irrespective of the order in which they appear.

check_row_order:

freeze_time.__exit__(__exc_type, __exc_value, ...+1)

= True

If False, allows the assert/test to succeed if the required rows are present, irrespective of the order in which they appear; as this requires sorting, you cannot set on frames that contain un-sortable columns.

Raises

error:

AssertionError

If left does not equal right

error:

MissingDependencyException

If chalkpy[runtime] is not installed.

freeze_time

Class

Functions

freeze_time.__init__(at)

Used to freeze the 'now' value used to execute filters like after(days_ago=30).

Parameters

at:

datetime

The time to freeze to. Must be timezone aware.

from chalk.features import online, DataFrame
from datetime import datetime, timedelta, timezone
@online
def get_average_spend_30d(
    spend: User.cards[after(days_ago=30)],
) -> User.average_spend_30d:
    return spend.mean()
with freeze_time(datetime(2021, 1, 1, tzinfo=timezone.utc)):
    now = datetime.now(tz=timezone.utc)
    get_average_spend_30d(
        spend=DataFrame([
            Card(spend=10, ts=now - timedelta(days=31)),
            Card(spend=20, ts=now - timedelta(days=29)),
            Card(spend=30, ts=now - timedelta(days=28)),
        ])
    )

freeze_time.time()

Returns the current time that filters will use.

Returns

type:

datetime

The current time that filters will use.

with freeze_time(datetime(2021, 1, 1, tzinfo=timezone.utc)) as ft:
    assert ft.time() == datetime(2021, 1, 1, tzinfo=timezone.utc)

freeze_time.__enter__()

The freeze_time class is a context manager, so it can be used with the with statement. The __enter__ method is called when entering the context manager.

The freeze_time class is a context manager, so it can be used with the with statement. The __exit__ method is automatically called when exiting the context manager.

Functions

Use chalk.functions to define Chalk Expressions to compute your features. These functions are designed to be used only in Chalk Expressions.

Logical Functions

coalesce(vals)

Return the first non-null entry

Parameters

vals:

= ()

Expressions to coalesce. They can be a combination of underscores and literals, though types must be compatible (ie do not coalesce int and string).

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   nickname: str | None
   name: str | None
   name_or_nickname: str = F.coalesce(_.name, _.nickname, "")

is_null(expr)

Check if a value is null.

Parameters

expr:

if_then_else(condition, if_true, ...+1)

The value to check if it is null.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   nickname: str | None
   missing_nickname: bool = F.is_null(_.nickname)

Create a conditional expression, roughly equivalent to:

if condition:
    return if_true
else:
    return if_false

Unlike a Python if/else, all three inputs (condition, if_true, if_false) are evaluated in parallel for all rows, and then the correct side is selected based on the result of the condition expression.

from chalk import _
from chalk.features import features
@features
class Transaction:
   id: int
   amount: int
   risk_score: float = _.if_then_else(
     _.amount > 10_000,
     _.amount * 0.1,
     _.amount * 0.05,
   )

map_dict(d, key, ...+1)

Map a key to a value in a dictionary.

Parameters

dict[KeyType, ValueType]

The dictionary to map the key to a value in.

key:

The key to look up in the dictionary.

ValueType | None

The default value to return if the key is not found in the dictionary.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: int
   merchant: str
   merchant_risk_score: float = F.map_dict(
       {"Amazon": 0.1, "Walmart": 0.08},
       _.merchant,
       default=0.,
   )

recover(vals)

Return the first valid entry. Functions like coalesce, but allows recovering from an upstream failure

Parameters

vals:

= ()

Expressions to recover. They can be a combination of underscores and literals, though types must be compatible (ie do not recover int and string).

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   potentially_error_causing_name: str | None
   fallback_name: str
   name: str = F.recover(_.potentially_error_causing_name, _.fallback_name)

when(condition)

Build a conditional expression.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   age: float
   age_group: str = (
     F.when(_.age < 1)
      .then("baby")
      .when(_.age < 3)
      .then("toddler")
      .when(_.age < 13)
      .then("child")
      .when(_.age < 18)
      .then("teen")
      .otherwise(F.cast(F.cast(F.floor(_.age / 10), int), str) + "0s")
  )

Prompt Functions

Use chalk.prompts to define and run prompts.

jinja(template)

Runs a Jinja template on the input columns. Supports a subset of Jinja features, specifically: variables and for loops. Inputs to the jinja should appear as feature fqns in the template.

Parameters

template:

completion(model, messages, ...+16)

The Jinja template to run.

import chalk.functions as F
from chalk.features import features, DataFrame
@features
class User:
   id: str
   score: float
   transactions: "DataFrame[Transaction]"
   description: str = F.jinja("User {{User.id}}. Score: {{User.score}}. Transactions: {% for txn in User.transactions %}{{txn.description}},{% endfor %}")

Generate LLM model completions from a list of messages.

Parameters

model:

The name of the model, e.g. "gpt-4o".

messages:

Sequence[Underscore]

The list of messages of the type P.Message. Each message in the array contains the following properties: role and content. The role of the message's author. Roles can be: system, user, or assistant. The contents of the message. It can be a string or a list of objects with the following properties: type and text or image_url.

timeout_seconds:

= None

The timeout in seconds for completion requests

output_structure:

= None

The object specifying the format that the model must output. Accepts a Pydantic model or a JSON schema string (see https://docs.pydantic.dev/1.10/usage/schema/).

temperature:

= None

The sampling temperature to be used, between 0 and 2 inclusive. Higher values like 0.8 produce more random outputs, while lower values like 0.2 make outputs more focused and deterministic. Note: This parameter is between 0 and 1 (inclusive) for Anthropic models.

top_p:

= None

The alternative to sampling with temperature. It instructs the model to consider the results of the tokens with top_p probability. For example, 0.1 means only the tokens comprising the top 10% probability mass are considered.

max_completion_tokens:

= None

The upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens.

max_tokens:

= None

The maximum number of tokens to generate in the chat completion.

stop:

Sequence[str] | None

= None

Custom text sequences that will cause the model to stop generating.

presence_penalty:

= None

It is used to penalize new tokens based on their existence in the text so far.

frequency_penalty:

= None

It is used to penalize new tokens based on their frequency in the text so far.

logit_bias:

dict[int, float] | None

= None

Used to modify the probability of specific tokens appearing in the completion.

seed:

= None

If specified, the system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed.

user:

= None

The unique identifier representing your end-user. This parameter is specific to OpenAI and can help to monitor and detect abuse.

model_provider:

= None

The model provider.

base_url:

= None

The URL of the API endpoint where requests are sent.

api_key:

= None

The API key to use for the completion.

num_retries:

= None

The number of times to retry the API call if an APIError, TimeoutError or ServiceUnavailableError occurs.

import chalk.prompts as P
import chalk.functions as F
from chalk.features import features, _
from pydantic import BaseModel
class EstimatedAge(BaseModel):
    age: float

...

@features
class User:
   id: str
   description: str
   estimated_age_response: P.PromptResponse = P.completion(
       model="gpt-4o",
       messages=[
           P.message(
               "system",
               F.jinja("Estimate the age of the user based on the description: {{User.description}}"),
           ),
       ],
       output_structure=EstimatedAge,
   )
   estimated_age: float = F.json_value(_.estimated_age_response.response, "$.age")
   image_url: str
   image_response: P.MultimodalPromptResponse = P.completion(
       model="gpt-4o",
       messages=[
           P.message(
               "system",
               [
                   {"type": "input_text", "text": "describe this image"},
                   {"type": "input_image", "image_url": _.image_url},
               ],
           ),
       ],
       max_tokens=100+2*F.length(_.description)
   )

run_prompt(name)

Runs a named prompt. Configure named prompts in the UI.

Parameters

name:

The name of the prompt to run.

import chalk.prompts as P
from chalk.features import features
@features
class User:
   id: str
   description: P.PromptResponse = P.run_prompt("describe_user")

Prompt

Class

Chalk Prompts are bundles of a model, a list of messages, and parameters.

Attributes

model

The name of the model, e.g. "gpt-4o".

messages

list[Message]

timeout_seconds

The timeout in seconds for completion requests.

output_structure

The object specifying the format that the model must output. Accepts a Pydantic model or a JSON schema string (see https://docs.pydantic.dev/1.10/usage/schema/).

temperature

top_p

max_completion_tokens

The upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens.

max_tokens

The maximum number of tokens to generate in the chat completion.

stop

list[str] | None

Custom text sequences that will cause the model to stop generating.

presence_penalty

It is used to penalize new tokens based on their existence in the text so far.

frequency_penalty

It is used to penalize new tokens based on their frequency in the text so far.

logit_bias

dict[int, float] | None

Used to modify the probability of specific tokens appearing in the completion.

seed

user

The unique identifier representing your end-user. This parameter is specific to OpenAI and can help to monitor and detect abuse.

model_provider

The model provider.

base_url

The URL of the API endpoint where requests are sent.

num_retries

The number of times to retry the API call if an APIError, TimeoutError or ServiceUnavailableError occurs.

MultimodalPrompt

Class

Chalk Prompts are bundles of a model, a list of messages, and parameters.

Attributes

model

The name of the model, e.g. "gpt-4o".

messages

list[MultimodalMessage]

The list of messages of the type P.Message. Each message in the array contains the following properties: role and content. The role of the message's author. Roles can be: system, user, or assistant. The contents of the message. It is a list of objects with the following properties: type and text or image_url.

timeout_seconds

The timeout in seconds for completion requests.

output_structure

The object specifying the format that the model must output. Accepts a Pydantic model or a JSON schema string (see https://docs.pydantic.dev/1.10/usage/schema/).

temperature

top_p

max_completion_tokens

The upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens.

max_tokens

The maximum number of tokens to generate in the chat completion.

stop

list[str] | None

Custom text sequences that will cause the model to stop generating.

presence_penalty

It is used to penalize new tokens based on their existence in the text so far.

frequency_penalty

It is used to penalize new tokens based on their frequency in the text so far.

logit_bias

dict[int, float] | None

Used to modify the probability of specific tokens appearing in the completion.

seed

user

The unique identifier representing your end-user. This parameter is specific to OpenAI and can help to monitor and detect abuse.

model_provider

The model provider.

base_url

The URL of the API endpoint where requests are sent.

num_retries

The number of times to retry the API call if an APIError, TimeoutError or ServiceUnavailableError occurs.

Usage

Class

Usage statistics for the response.

Attributes

input_tokens

Number of tokens in the request.

output_tokens

Number of tokens in the response.

total_tokens

Total number of tokens used, equal to input_tokens + output_tokens.

RuntimeStats

Class

Runtime statistics for the response.

Attributes

total_latency

float

Total time in seconds to generate the response, including any retries.

last_try_latency

Time in seconds to generate the response in the last successful try.

total_retries

Total number of retries.

rate_limit_retries

Number of retries due to rate limiting.

PromptResponse

Class

Response from the model.

Attributes

response

Response from the model. Raw string if no output structure specified, json encoded string otherwise. None if the response was not received or incorrectly formatted.

prompt

Prompt

Prompt used to generate the response.

usage

Usage

Usage statistics for the response.

runtime_stats

RuntimeStats

Runtime statistics for the response.

MultimodalPromptResponse

Class

Response from the model.

Attributes

response

Response from the model. Raw string if no output structure specified, json encoded string otherwise. None if the response was not received or incorrectly formatted.

prompt

MultimodalPrompt

Prompt used to generate the response.

usage

Usage

Usage statistics for the response.

runtime_stats

RuntimeStats

Runtime statistics for the response.

String Functions

ends_with(expr, suffix)

Evaluates if the string ends with the specified suffix.

Parameters

expr:

The string to check against the suffix.

suffix:

The suffix or feature to check if the string ends with.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   category: str
   is_food: bool = F.ends_with(_.name, "Food")

jaccard_similarity(a, b)

Compute the Jaccard similarity, character by character, between two strings. Returns a float in the range [0, 1].

Parameters

The first string.

jaro_winkler_distance(a, b, ...+1)

The second string.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   email: str
   name_email_sim: int = F.jaccard_similarity(_.name, _.email)

Compute the Jaro-Winkler distance between two strings.

Parameters

The first string.

The second string.

= 0.1

The prefix weight parameter for the distance calculation. Should be between 0.0 and 0.25. 0.1 by default.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   email: str
   name_email_sim: int = F.jaro_winkler_distance(_.name, _.email)

levenshtein_distance(a, b)

Compute the Levenshtein distance between two strings.

Parameters

The first string.

The second string.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   email: str
   name_email_sim: int = F.levenshtein_distance(_.name, _.email)

like(expr, pattern)

Evaluates if the string matches the pattern.

Patterns can contain regular characters as well as wildcards. Wildcard characters can be escaped using the single character specified for the escape parameter. Matching is case-sensitive.

Note: The wildcard % represents 0, 1 or multiple characters and the wildcard _ represents exactly one character.

For example, the pattern John% will match any string that starts with John, such as John, JohnDoe, JohnSmith, etc.

The pattern John_ will match any string that starts with John and is followed by exactly one character, such as JohnD, JohnS, etc. but not John, JohnDoe, JohnSmith, etc.

Parameters

expr:

The string to check against the pattern.

The pattern to check the string against.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   is_john: bool = F.like(_.name, "John%")

length(expr)

Compute the length of a string.

Parameters

expr:

The string to compute the length of.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction
   id: str
   category: str
   category_length: int = F.length(_.category)

lower(expr)

Convert a string to lowercase.

Parameters

expr:

The string to convert to lowercase.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   normalized: str = F.trim(F.lower(_.name))

partial_ratio(a, b)

Compute the Fuzzy Wuzzy partial ratio between two strings. Returns a value in the range [0, 100].

Parameters

The first string.

regexp_extract(expr, pattern, ...+1)

The second string.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   email: str
   name_email_sim: int = F.partial_ratio(_.name, _.email)

Finds the first occurrence of the regular expression pattern in the string and returns the capturing group number group.

Parameters

expr:

The string to check against the pattern.

The regular expression pattern to check the string against.

regexp_extract_all(expr, pattern, ...+1)

The number of the capturing group to extract from the string.

import chalk.functions as F
from chalk.features import _, features
@features
class HiddenNumber:
   id: str
   hidden_number: str = "O0OOO",
   number: str = F.regexp_extract(_.time,  r"([0-9]+)", 1)

Finds all occurrences of the regular expression pattern in string and returns the capturing group number group.

Parameters

expr:

The string to check against the pattern.

The regular expression pattern to check the string against.

regexp_like(expr, pattern)

The number of the capturing group to extract from the string.

import chalk.functions as F
from chalk.features import _, features
@features
class Time:
   id: str
   time: str = "1y 342d 20h 60m 6s",
   processed_time: list[str] = F.regexp_extract_all(_.time, "([0-9]+)([ydhms])", 2)

Evaluates the regular expression pattern and determines if it is contained within string.

This function is similar to the like function, except that the pattern only needs to be contained within string, rather than needing to match all the string. In other words, this performs a contains operation rather than a match operation. You can match the entire string by anchoring the pattern using ^ and $.

Parameters

expr:

The string to check against the pattern.

regexp_replace(expr, pattern, ...+1)

Underscore | str | Any

The regular expression pattern to check the string against.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   is_john: bool = F.regexp_like(_.name, "^John.*$")

Replaces every instance of expr matched by the regular expression pattern in pattern with replacement. Capturing groups can be referenced in replacement using $1, $2, etc. for a numbered group or ${name} for a named group. A dollar sign ($) may be included in the replacement by escaping it with a backslash. If a backslash is followed by any character other than a digit or another backslash in the replacement, the preceding backslash will be ignored.

If no replacement is provided, the matched pattern will be removed from the string.

Parameters

expr:

The string to replace the pattern in.

The regular expression pattern to replace.

replacement:

replace(expr, old, ...+1)

= None

The string to replace the pattern with.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   normalize_regex: str
   normalized_name: str = F.regexp_replace(_.name, _.normalize_regex, " ")

Replace all occurrences of a substring in a string with another substring.

Parameters

expr:

The string to replace the substring in.

old:

The substring to replace.

new:

The substring to replace the old substring with.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   normalized_name: str = F.replace(_.name, " ", "_")

reverse(expr)

Reverse the order of a string.

Parameters

expr:

sequence_matcher_ratio(a, b)

The string to reverse.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   reversed_name: str = F.reverse(_.name)

Measure the similarity of two strings as by Python difflib. Equivalent to difflib.SequenceMatcher(None, a, b).ratio(). Returns a value in the range [0, 1].

Parameters

The first string.

split(expr, delimiter, ...+1)

The second string.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   email: str
   name_email_sim: int = F.sequence_matcher_ratio(_.name, _.email)

Splits string by delimiter, returning a list of strings. If maxsplit is set, at most maxsplit splits are performed.

Parameters

expr:

The string to split.

delimiter:

The delimiter to split the string on.

maxsplit:

split_part(expr, delimiter, ...+1)

= None

The maximum number of times to split the string. Once the string has been split "maxsplit" times (starting from the left), the rest of the string is left untouched.

import chalk.functions as F
from chalk.features import _, features
@features
class CSVRow:
   id: str
   data: str
   data: list[str] = F.split(_.data, delimiter=",")

Splits string by delimiter and returns the index'th element (0-indexed). If the index is larger than the number of fields, returns None.

Parameters

expr:

The string to split.

delimiter:

The delimiter to split the string on.

index:

starts_with(expr, prefix)

The index of the the split to return.

import chalk.functions as F
from chalk.features import _, features
@features
class CSVRow:
   id: str
   data: str
   first_element: str = F.split_part(_.data, delimiter=",", index=0)

Evaluates if the string starts with the specified prefix.

Parameters

expr:

The string to check against the prefix.

prefix:

The prefix or feature to check if the string starts with.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   category: str
   is_food: bool = F.starts_with(_.name, "Food")

strpos(expr, substring)

Returns the position of the first occurrence of substring in string. Returns -1 if substring is not found.

Parameters

expr:

The string to search for the substring in.

substring:

The substring to search for in the string.

import chalk.functions as F
from chalk.features import _, features
@features
class Email:
  email: str
  username_length: int = F.strpos(_.email, "@")

strrpos(expr, substring)

Returns the position of the last occurrence of substring in string. Returns -1 if substring is not found.

Parameters

expr:

The string to search for the substring in.

substring:

substr(expr, start, ...+1)

The substring to search for in the string.

import chalk.functions as F
from chalk.features import _, features
@features
class Email:
  email: str
  domain_length: int = F.strrpos(_.email, "@")

Extract a substring from a string.

Parameters

expr:

The string to extract the substring from.

start:

The starting index of the substring (0-indexed).

length:

= None

The length of the substring. If None, the substring will extend to the end of the string.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   category: str
   cat_first_three: str = F.substr(_.category, 0, 3)

trim(expr)

Remove leading and trailing whitespace from a string.

Parameters

expr:

The string to trim.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   trimmed_name: str = F.trim(_.name)

upper(expr)

Convert a string to uppercase.

Parameters

expr:

Encoding and Decoding Functions

The string to convert to uppercase.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   category: str
   normalized: str = F.trim(F.upper(_.category))

bytes_to_string(expr, encoding)

Convert bytes to a string using the specified encoding.

Parameters

expr:

A bytes feature to convert to a string.

encoding:

'utf-8' | 'hex' | 'base64'

The encoding to use when converting the bytes to a string.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   hashed_name: bytes
   decoded_name: str = F.bytes_to_string(_.hashed_name, encoding="utf-8")

from_big_endian_64(expr)

Convert a 64-bit big-endian bytes value to an integer.

Parameters

expr:

A bytes feature to convert.

import chalk.functions as F
from chalk.features import _, features
@features
class ByteData:
   id: str
   raw_bytes: bytes
   value: int = F.from_big_endian_64(_.raw_bytes)

from_big_endian_32(expr)

Convert a 32-bit big-endian bytes value to an integer.

Parameters

expr:

A bytes feature to convert.

import chalk.functions as F
from chalk.features import _, features
@features
class ByteData:
   id: str
   raw_bytes: bytes
   value: int = F.from_big_endian_32(_.raw_bytes)

gunzip(expr)

Decompress a GZIP-compressed bytes feature.

Parameters

expr:

json_extract_array(expr, path)

The GZIP-compressed bytes feature to decompress.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   compressed_data: bytes
   decompressed_data: bytes = F.gunzip(_.compressed_data)

Extract an array from a JSON string feature using a JSONPath expression. The value of the referenced path must be a JSON node containing an array, or a wildcard object match like: $some_path[*].some_object_property.

Only arrays of strings, bools, numbers, and nulls are supported. If the array contains objects, the function will return 'null'.

Parameters

expr:

The JSON string feature to query.

path:

str | Underscore

The JSONPath-like expression to extract the array from the JSON feature.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   profile: str
   favorite_categories: list[str] = F.json_extract_array(_.profile, "$.prefs.favorite_categories")

json_value(expr, path)

Extract structured data from a JSON string feature using a JSONPath expression.

Parameters

expr:

The JSON string feature to query.

path:

str | Underscore

The JSONPath-like expression to extract the scalar from the JSON feature.

import chalk.functions as F
from chalk.features import _, features
from dataclasses import dataclass
@dataclass
class Message:
   content: str
   sender: str
   comments: List[str]

...

@features
class User:
   id: str
   profile: str
   favorite_color: str = F.json_value(_.profile, "$.prefs.color")
   messages: List[Message] = F.json_value(_.profile, "$.messages")

md5(expr)

Compute the MD5 hash of some bytes.

Parameters

expr:

A bytes feature to hash.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   bytes_feature: bytes
   md5_bytes: bytes = F.md5(_.bytes_feature)

sha1(expr)

Compute the SHA-1 hash of some bytes.

Parameters

expr:

A bytes feature to hash.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   bytes_feature: bytes
   sha1_bytes: bytes = F.sha1(_.bytes_feature)

sha256(expr)

Compute the SHA-256 hash of some bytes.

Parameters

expr:

A bytes feature to hash.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   bytes_feature: bytes
   sha256_bytes: bytes = F.sha256(_.bytes_feature)

sha512(expr)

Compute the SHA-512 hash of some bytes.

Parameters

expr:

A bytes feature to hash.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   bytes_feature: bytes
   sha512_bytes: bytes = F.sha512(_.bytes_feature)

spooky_hash_v2_32(expr)

Compute the SpookyHash V2 32-bit hash of a string. This hash function is not cryptographically secure, but it is deterministic and fast.

Parameters

expr:

A string feature to hash.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   name_hash: bytes = F.spooky_hash_v2_32(
       F.string_to_bytes(_.name, "utf-8")
   )

spooky_hash_v2_64(expr)

Compute the SpookyHash V2 64-bit hash of a string. This hash function is not cryptographically secure, but it is deterministic and fast.

Parameters

expr:

string_to_bytes(expr, encoding)

A string feature to hash.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   name_hash: bytes = F.spooky_hash_v2_64(
       F.string_to_bytes(_.name, "utf-8")
   )

Convert a string to bytes using the specified encoding.

Parameters

expr:

An underscore expression for a feature to a string feature that should be converted to bytes.

encoding:

'utf-8' | 'hex' | 'base64'

The encoding to use when converting the string to bytes.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   name: str
   hashed_name: bytes = F.string_to_bytes(_.name, encoding="utf-8")

Math Functions

abs(expr)

Compute the absolute value of a number.

Parameters

expr:

The number to compute the absolute value of.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   amount: float
   amount_abs: float = F.abs(_.amount)

asin(expr)

Compute the arcsine of an angle in radians.

Parameters

expr:

bankers_round(value, digits)

The angle in radians.

import chalk.functions as F
from chalk.features import _, features
@features
class Triangle:
   id: str
   sin_angle: float
   angle: float = F.asin(_.sin_angle)

Round a number to the nearest integer. Values exactly halfway round to the nearest even integer.

Parameters

The number to round.

digits:

= None

The number of significant digits to round to. If None, the number is banker's rounded to the nearest integer.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   amount: float
   rounded_amount: int = F.bankers_round(_.amount)

ceil(expr)

Compute the ceiling of a number.

Parameters

expr:

The number to compute the ceiling of.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   amount: float
   amount_ceil: float = F.ceil(_.amount)

cos(expr)

Compute the cosine of an angle in radians.

Parameters

expr:

The angle in radians.

import chalk.functions as F
from chalk.features import _, features
@features
class Triangle:
   id: str
   angle: float
   cos_angle: float = F.cos(_.angle)

cosine_similarity(a, b)

Compute the cosine similarity between two vectors.

Parameters

The first vector.

The second vector.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   embedding: Vector[1536]
@features
class Merchant:
   id: str
   embedding: Vector[1536]
@features
class UserMerchant:
   id: str
   user_id: User.id
   user: User
   merchant_id: Merchant.id
   merchant: Merchant
   similarity: float = F.cosine_similarity(_.user.embedding, _.merchant.embedding)

exp(expr)

Returns Euler’s number raised to the power of x.

Parameters

expr:

The exponent to raise Euler's number to.

import chalk.functions as F
from chalk.features import _, features
@features
class Triangle:
   id: str
   x: float
   e_to_x: float = F.exp(_.x)

floor(expr)

Compute the floor of a number.

Parameters

expr:

The number to compute the floor of.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   amount: float
   amount_floor: float = F.floor(_.amount)

from_base(value, base)

Convert a number in a base to an integer.

Parameters

haversine(lat1, lon1, ...+3)

The number to convert.

base:

int | Underscore

The base of the number. Must be between 2 and 36.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   base_16: str
   base_10: int = F.from_base(_.base_16, 16)

Compute the haversine distance (in kilometers) between two points on Earth. By default the inputs should be in radians, but alternate input units can be specified through the unit parameter.

Parameters

lat1:

The latitude of the first point in radians.

lon1:

The longitude of the first point in radians.

lat2:

The latitude of the second point in radians.

lon2:

The longitude of the second point in radians.

unit:

'degrees' | 'radians'

= 'radians'

The unit of the input [lat1, lon1, lat2, lon2]. The default is radians.

import chalk.functions as F
from chalk.features import _, features
@features
class Location:
   id: str
   lat1: float
   lon1: float
   lat2: float
   lon2: float
   distance: float = F.haversine(_.lat1, _.lon1, _.lat2, _.lon2, unit="degrees")

ln(expr)

Compute the natural logarithm of a number.

Parameters

expr:

The number to compute the natural logarithm of.

import chalk.functions as F
from chalk.features import _, features
@features
class Triangle:
   id: str
   hypotenuse: float
   log_hypotenuse: float = F.ln(_.hypotenuse)

mod(dividend, divisor)

Compute the remainder of a division.

Parameters

dividend:

The dividend.

divisor:

The divisor.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: datetime
   day_of_week_monday: int = F.day_of_week(_.date)
   day_of_week_sunday: int = F.mod(_.day_of_week_monday, 7) + 1

power(a, b)

Raise a to the power of b. Alias for a ** b.

Parameters

The base.

The exponent.

import chalk.functions as F
from chalk.features import _, features
@features
class Merchant:
   id: str
   amount_std: float
   amount_var: float = F.power(_.amount_std, 2)

round(value, digits)

Round a number to the nearest integer.

Parameters

The number to round.

digits:

= None

The number of significant digits to round to. If None, the number is rounded to the nearest integer.

import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   amount: float
   rounded_amount: int = F.round(_.amount)

safe_divide(x, y)

Computes x / y, returning None if y is 0.

Parameters

The numerator.

The denominator.

import chalk.functions as F
from chalk.features import _, features
@features
class Merchant:
   id: str
   a: float
   b: float
   amount_std: float = F.safe_divide(_.a, _.b)

sigmoid(expr)

Compute the sigmoid of a number.

Parameters

expr:

The number to compute the sigmoid of.

import chalk.functions as F
from chalk.features import _, features
@features
class Sigmoid:
   id: str
   x: float
   sigmoid_of_x: float = F.sigmoid(_.x)

sin(expr)

Compute the sine of an angle in radians.

Parameters

expr:

The angle in radians.

import chalk.functions as F
from chalk.features import _, features
@features
class Triangle:
   id: str
   angle: float
   sin_angle: float = F.sin(_.angle)

sqrt(expr)

Compute the square root of a number.

Parameters

expr:

The number to compute the square root of.

import chalk.functions as F
from chalk.features import _, features
@features
class Merchant:
   id: str
   amount_var: float
   amount_std: float = F.sqrt(_.amount_var)

Date and Time Functions

date_trunc(expr, unit)

For example, the following table shows the result of truncating the input datetime 2024-09-17 12:34:56.789 with the various units:

Unit	Result
second	2024-09-17 12:34:56
minute	2024-09-17 12:34
hour	2024-09-17 12:00
day	2024-09-17
week	2024-09-16
month	2024-09-01
quarter	2024-07-01
year	2024-01-01

day_of_month(expr)

Extract the day of the month from a date.

The supported types for x are date and datetime.

Ranges from 1 to 31 inclusive.

Parameters

expr:

day_of_week(expr, start_of_week)

The date to extract the day of the month from.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction
   id: str
   date: date
   day: int = F.day_of_month(_.date)

Returns the ISO day of the week from x. The value ranges from 1 (start_of_week, default MONDAY) to 7 (start_of_week + 6, default SUNDAY).

Parameters

expr:

The date to extract the day of the week from.

start_of_week:

DayOfWeek

= DayOfWeek.MONDAY

The day of the week that the week starts on. Defaults to Monday.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction
   id: str
   date: date
   day: int = F.day_of_week(_.date)

day_of_year(expr)

Extract the day of the year from a date.

The value ranges from 1 to 366.

Parameters

expr:

format_datetime(input_dt, format)

The date to extract the day of the year from.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: date
   day: int = F.day_of_year(_.date)

Format a datetime feature using a Joda-Time format string.

| Symbol | Meaning                      | Examples                           |
|--------|------------------------------|------------------------------------|
| G      | era                          | AD                                 |
| C      | century of era (>=0)         | 20                                 |
| Y      | year of era (>=0)            | 1996                               |
| x      | weekyear                     | 1996                               |
| w      | week of weekyear             | 27                                 |
| e      | day of week                  | 2                                  |
| E      | day of week                  | Tuesday; Tue                       |
| y      | year                         | 1996                               |
| D      | day of year                  | 189                                |
| M      | month of year                | July; Jul; 07                      |
| d      | day of month                 | 10                                 |
| a      | halfday of day               | PM                                 |
| K      | hour of halfday (0~11)       | 0                                  |
| h      | clockhour of halfday (1~12)  | 12                                 |
| H      | hour of day (0~23)           | 0                                  |
| k      | clockhour of day (1~24)      | 24                                 |
| m      | minute of hour               | 30                                 |
| s      | second of minute             | 55                                 |
| S      | fraction of second           | 978                                |
| z      | time zone                    | Pacific Standard Time; PST         |
| Z      | time zone offset/id          | -0800; -08:00; America/Los_Angeles |
| '      | escape for text              |                                    |
| ''     | single quote                 | '                                  |

from datetime import datetime
from chalk.features import _, features
@features
class Iso8601:
  id: int
  dt: datetime
  formatted_datetime: str = F.format_datetime(_.dt, "YYYY-MM-DD HH:mm:ss")
  other_formatted_datetime: str = F.format_datetime(_.dt, "YY-MM-DD HH:mm:ss.S")

from_unix_milliseconds(expr)

Converts a Unix timestamp (in milliseconds) to a utc timestamp.

Parameters

expr:

A date represented as the number of millisecods since the Unix timestamp.

from datetime import datetime
import chalk.functions as F
from chalk.features import _, features
@features
class Linux:
   id: int
   unixtime_ms: int = 0
   unixtime: datetime = F.from_unix_milliseconds(_.unix)

from_unix_seconds(expr)

Converts a Unix timestamp (in seconds) to a utc timestamp.

Parameters

expr:

The Unix timestamp to convert.

from datetime import datetime
import chalk.functions as F
from chalk.features import _, features
@features
class Linux:
   id: str
   unixtime_s: int = 0
   unix: int = F.unix_milliseconds(_.date)

hour_of_day(expr, tz)

Extract the hour of the day from a datetime.

The value ranges from 0 to 23.tz The timezone to use for the hour. By default, UTC is used.

Parameters

expr:

The datetime to extract the hour of the day from.

tz:

dt.timezone | None

= None

from datetime import datetime
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: datetime
   hour: int = F.hour_of_day(_.date)

is_leap_year(expr)

Determine whether the given date is in a leap year.

Parameters

expr:

The date to test.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: date
   leap_year: bool = F.is_leap_year(_.date)

is_month_end(expr)

Determine whether the provided date is the last day of the month.

Parameters

expr:

is_us_federal_holiday(expr)

The date to test.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: date
   month_end: bool= F.is_month_end(_.date)

Returns True if the given date or datetime is a US Federal Holiday, else False

Notes

Here is a list of the US Federal Holidays:

New Year's Day (January 1)*
Martin Luther King's Birthday (3rd Monday in January)
Washington's Birthday (3rd Monday in February)**
Memorial Day (last Monday in May)
Juneteenth National Independence Day (June 19)*
Independence Day (July 4)*
Labor Day (1st Monday in September)
Columbus Day (2nd Monday in October)
Veterans' Day (November 11)*
Thanksgiving Day (4th Thursday in November)
Christmas Day (December 25)*

If one of these dates would fall on a Saturday/Sunday, the federal holiday will be observed on the proceeding Friday/following Monday, respectively

** More commonly known as "Presidents' Day"

* Every four years, Inaguration Day (January 20) is recognized as a federal holiday exclusively in Washington D.C. Inaguration days are not accounted for in this underscore

Parameters

expr:

The date or datetime to be tested

from datetime import datetime
import chalk.functions as F
from chalk.features import _, features
@features
class Event:
   id: str
   event_date: datetime
   is_us_federal_holiday: F.is_us_federal_holiday(_.event_date)

last_day_of_month(expr)

Given a date, returns the last day in that date's month.

Parameters

expr:

The date whose corresponding month (and year) will be used to determine the last day of the month.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: date
   last_day_of_month: int = F.last_day_of_month(_.date)

month_of_year(expr)

Extract the month of the year from a date.

The value ranges from 1 to 12.

Parameters

expr:

The date to extract the month of the year from.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: date
   month: int = F.month_of_year(_.date)

quarter(expr)

Extract the quarter from the date.

The value ranges from 1 to 4.

Parameters

expr:

The date to extract the quarter from.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: date
   quarter: int = F.quarter(_.date)

to_iso8601(expr)

Formats input datetime as an ISO 8601 string

Parameters

expr:

The datetime to convert into ISO 8601 string.

from datetime import datetime
import chalk.functions as F
from chalk.features import _, features
@features
class IsoStr:
   id: str
   iso_str: str = F.to_iso8601(_.iso_date)
   iso_date: datetime

total_seconds(delta)

Compute the total number of seconds covered in a duration.

Parameters

delta:

The duration to convert to seconds.

from datetime import date
from chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   signup: date
   last_login: date
   signup_to_last_login_days: float = F.total_seconds(_.last_login - _.signup) / (60 * 60 * 24)

unix_milliseconds(expr)

Extract the number of milliseconds since the Unix epoch. Returned as a float.

Parameters

expr:

The datetime to extract the number of milliseconds since the Unix epoch from.

from datetime import datetime
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: datetime
   unix_milliseconds: float = F.unix_milliseconds(_.date)

unix_seconds(expr)

Extract the number of seconds since the Unix epoch. Returned as a float.

Parameters

expr:

The datetime to extract the number of seconds since the Unix epoch from.

from datetime import datetime
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: datetime
   unix_seconds: float = F.unix_seconds(_.date)

week_of_year(expr)

Extract the week of the year from a date.

The value ranges from 1 to 53.

Parameters

expr:

The date to extract the week of the year from.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: date
   week: int = F.week_of_year(_.date)

year(expr)

Extract the year from the date.

Parameters

expr:

url_extract_protocol(expr)

The date to extract the year from.

from datetime import date
import chalk.functions as F
from chalk.features import _, features
@features
class Transaction:
   id: str
   date: date
   year: int = F.year(_.date)

URL Functions

url_extract_host(expr)

Extract the host from a URL.

For example, the host of https://www.google.com/cats is www.google.com.

import chalk.functions as F
from chalk.features import _, features
@features
class Company:
    id: int
    website: str
    host: str = F.url_extract_host(_.website)

url_extract_path(expr)

Extract the path from a URL.

For example, the host of https://www.google.com/cats is /cats.

import chalk.functions as F
from chalk.features import _, features
@features
class Company:
    id: int
    website: str
    path: str = F.url_extract_path(_.website)

Extract the protocol from a URL.

For example, the protocol of https://www.google.com/cats is https.

Parameters

expr:

The URL to extract the protocol from.

import chalk.functions as F
from chalk.features import _, features
@features
class Company:
    id: int
    website: str
    protocol: str = F.url_extract_protocol(_.website)

Array Functions

array_agg(expr)

Extract a single-column DataFrame into a list of values for that column.

Parameters

expr:

The expression to extract into a list.

from datetime import datetime
import chalk.functions as F
from chalk import DataFrame
from chalk.features import _, features
@features
class Merchant:
    id: str
    events: "DataFrame[FraudEvent]"
    fraud_codes: list[str] = F.array_agg(_.events[_.is_fraud == True, _.tag])
@features
class FraudEvent:
    id: int
    tag: str
    is_fraud: bool
    mer_id: Merchant.id

array_average(expr)

Calculates the arithmetic mean of the numerical values of an array.

Parameters

expr:

array_count_value(expr, value)

The array to average

Returns the count of a string value in an array.

Parameters

expr:

The string array.

str | Underscore

The value to count in the array

import chalk.functions as F
from chalk.features import _, features
@features
class Bookstore:
   id: str
   inventory_types: list[str] = ["fiction", "non-fiction", "fiction", "fiction", "non-fiction"]
   books: str = F.array_count_value(_.inventory_types, "fiction")

array_distinct(arr)

Returns an array with distinct elements from the input array.

Parameters

arr:

array_join(arr, delimiter)

The array to extract distinct elements from.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   tags: list[str]
   unique_tags: list[str] = F.array_distinct(_.tags)

Concatenate the elements of an array into a single string with a delimiter.

Parameters

arr:

Underscore | list[Any]

The array to join. The values will be casted to strings if they are not already strings.

delimiter:

The delimiter to use to join the elements of the array.

from datetime import datetime
import chalk.functions as F
from chalk.features import _, features
@features
class Wordle:
   id: str
   words: list[str]
   words_str: str = F.array_join(_.words, ", ")

array_max(arr)

Returns the maximum value in an array.

Parameters

arr:

The array to find the maximum value in.

import chalk.functions as F
from chalk.features import _, features
@features
class Wordle:
   id: str
   words: list[str] = ["crane", "kayak", "plots", "fight", "exact", "zebra", "hello", "world"]
   longest_word: str = F.array_max(_.words)

array_median(expr)

Calculates the median of the numerical values of an array.

Parameters

expr:

The array to take the median of

array_min(arr)

Returns the minimum value in an array.

Parameters

arr:

array_sample_stddev(expr)

The array to find the minimum value in.

import chalk.functions as F
from chalk.features import _, features
@features
class Wordle:
   id: str
   words: list[str] = ["crane", "kayak", "plots", "fight", "exact", "zebra", "hello", "world"]
   shortest_word: str = F.array_min(_.words)

Calculates the sample standard deviation of the numerical values of an array. Divides square difference from means by N-1 as opposed to N

Parameters

expr:

array_sort(expr, descending)

The array to calculate the sample standard deviation

Returns an array which has the sorted order of the input array. Null elements will be placed at the end of the returned array.

Parameters

expr:

The array to sort

descending:

= False

Whether to sort the array in descending order. Defaults to False.

array_stddev(expr)

Calculates the standard deviation of the numerical values of an array.

Parameters

expr:

The array to calculate the standard deviation

array_sum(expr)

Calculates the sum of the numerical values of an array.

Parameters

expr:

The array to sum

cardinality(arr)

Returns the number of elements in an array.

Parameters

arr:

The array to count the number of elements in.

import chalk.functions as F
from chalk.features import _, features
@features
class Wordle:
   id: str
   words: list[str]
   num_words: int = F.cardinality(_.words)

head(dataframe, n)

Returns the first n items from a dataframe or has-many

Parameters

the has-many from which the first n items are taken

Underscore | int

how many items to take

from datetime import datetime
import chalk.functions as F
from chalk import windowed, DataFrame, Windowed
from chalk.features import _, features, Primary
@features
class Merchant:
    id: str
@features
class ConfirmedFraud:
    id: int
    trn_dt: datetime
    is_fraud: int
    mer_id: Merchant.id
@features
class MerchantFraud:
    mer_id: Primary[Merchant.id]
    merchant: Merchant
    confirmed_fraud: DataFrame[ConfirmedFraud] = dataframe(
        lambda: ConfirmedFraud.mer_id == MerchantFraud.mer_id,
    )
    first_five_merchant_window_fraud: Windowed[list[int]] = windowed(
        "1d",
        "30d",
        expression=F.head(_.confirmed_fraud[_.trn_dt > _.chalk_window, _.id, _.is_fraud == 1], 5)
    )

element_at(arr, index)

Returns the element of an array at the given index.

Parameters

arr:

Underscore | list[Any]

The array.

index:

int | Underscore

The index to extract the element from (0-indexed).

import chalk.functions as F
from chalk.features import _, features
@features
class Wordle:
   id: str
   words: list[str] = ["crane", "kayak", "plots", "fight", "exact", "zebra", "hello", "world"]
   first_word: str = F.element_at(_.words, 0)

max(values)

Returns the maximum value in a list of values. This function is meant to be supplied with several columns, not with a single has-many or DataFrame.

Input None values are ignored. If all inputs are None, then None is returned.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   score_1: float
   score_2: float
   max_score: float = F.max(_.score_1, _.score_2)

max_by(dataframe, sort)

Returns the row in a dataframe or has-many relationship with the maximum value in a given column.

Parameters

The DataFrame or has-many relationship to find the maximum value in. This DataFrame should refer to exactly one result column, which should reference the returned value. The DataFrame can include any necessary filters.

sort:

max_by_n(dataframe, sort, ...+1)

The column on which to sort.

Returns

from chalk import DataFrame, _
from chalk.features import features, has_one
import chalk.functions as F
@features
class Transaction:
    id: int
    amount: float
    user_id: int
@features
class User:
    id: int
    transactions: DataFrame[Transaction] = has_many(lambda: User.id == Transaction.user_id)
    biggest_transfer_id: int = F.max_by(
        _.transactions[_.category == "ach", _.id],
        _.amount,
    )
    biggest_transfer: Transaction = has_one(
        lambda: Transaction.id == User.biggest_transfer_id
    )

Returns the rows in a dataframe or has-many relationship with the maximum n values in a given column. This is equivalent to sort_by(sort_col, DESC).head(n)[result_col]

Parameters

sort:

The column on which to sort.

The number of rows to return.

Returns

from datetime import datetime
from chalk import DataFrame, _
from chalk.features import features, has_one
import chalk.functions as F
@features
class Transaction:
    id: int
    processed_date: datetime
    amount: float
    user_id: "User.id"
@features
class User:
    id: int
    transactions: DataFrame[Transaction]
    last_3_txns_avg: float = F.array_average(
        F.max_by_n(
            _.transactions[_.amount],
            _.processed_date,
            3,
        )
    )

min(values)

Returns the minimum value in a list of values. This function is meant to be supplied with several columns, not with a single has-many or DataFrame.

Input None values are ignored. If all inputs are None, then None is returned.

import chalk.functions as F
from chalk.features import _, features
@features
class User:
   id: str
   score_1: float
   score_2: float
   min_score: float = F.min(_.score_1, _.score_2)

min_by(dataframe, sort)

Returns the row in a dataframe or has-many relationship with the minimum value in a given column.

Parameters

The DataFrame or has-many relationship to find the minimum value in. This DataFrame should refer to exactly one result column, which should reference the returned value. The DataFrame can include any necessary filters.

sort:

min_by_n(dataframe, sort, ...+1)

The column on which to sort.

Returns

from chalk import DataFrame, _
from chalk.features import features, has_one
import chalk.functions as F
@features
class Transaction:
    id: int
    amount: float
    user_id: int
@features
class User:
    id: int
    transactions: DataFrame[Transaction] = has_many(lambda: User.id == Transaction.user_id)
    smallest_transfer_id: int = F.min_by(
        _.transactions[_.category == "ach", _.id],
        _.amount,
    )
    smallest_transfer: Transaction = has_one(
        lambda: Transaction.id == User.smallest_transfer_id
    )

Returns the rows in a dataframe or has-many relationship with the minimum n values in a given column. This is equivalent to sort_by(sort_col, ASC).head(n)[result_col]

Parameters

sort:

The column on which to sort.

slice(arr, offset, ...+1)

The number of rows to return.

Returns

from datetime import datetime
from chalk import DataFrame, _
from chalk.features import features, has_one
import chalk.functions as F
@features
class Transaction:
    id: int
    processed_date: datetime
    amount: float
    user_id: "User.id"
@features
class User:
    id: int
    transactions: DataFrame[Transaction]
    earliest_3_txns_avg: float = F.array_average(
        F.min_by_n(
            _.transactions[_.amount],
            _.processed_date,
            3,
        )
    )

Returns a subset of the original array.

Parameters

arr:

Underscore | list[Any]

The array to slice

offset:

Underscore | int

Starting index of the slice (0-indexed). If negative, slice starts from the end of the array

length:

Underscore | int

The length of the slice.

from datetime import datetime
import chalk.functions as F
from chalk.features import _, features
@features
class Wordle:
   id: str
   words: list[str] = ["crane", "kayak", "plots", "fight", "exact", "zebra", "hello", "world"]
   three_most_recent_words: list[str] = F.slice(_.words, -3, 3) # computes ["zebra", "hello", "world"]

contains(arr, value)

Returns whether the array contains the value.

Parameters

arr:

Underscore | list[Any] | set[Any]

The array to check for the value.

http_delete(url, headers, ...+3)

The value to check for in the array.

import chalk.functions as F
from chalk.features import _, features
@features
class APIRequest:
   id: str
   headers: list[str]
   has_user_agent: bool = F.contains(_.headers, "User-Agent")

HTTP Functions

HTTP PUT request. See http_request for more details.

http_get(url, headers, ...+3)

HTTP GET request. See http_request for more details.

http_post(url, headers, ...+3)

HTTP POST request. See http_request for more details.

http_put(url, headers, ...+3)

HTTP PUT request. See http_request for more details.

http_request(url, method, ...+4)

Make an HTTP request. The return type of this function is a HttpResponse, and features that are the result of this function should be annotated as such.

HttpResponse is considered to be a struct, and you may use struct accessing syntax to access fields, HttpResponse consists of

status_code: int
headers: map
body: bytes or string
final_url: string

Parameters

url:

Underscore | str

The URL to send the request to. Must start with http:// or https://.

method:

_HttpMethod

The HTTP method to use. Must be one of GET, POST, PUT, DELETE, PATCH, OPTIONS, HEAD, TRACE, CONNECT.

headers:

Underscore | dict[str, str] | None

= None

The headers to include in the request, if any. Should take the form of a map of strings to strings. The Content-Length header is automatically added to the headers, unless you explicitly set it in the request headers.

body:

Underscore | str | bytes | None

= None

The body of the request, if any. Should be a string or bytes.

allow_redirects:

= True

Whether to follow redirects. Defaults to True.