Updates
Updates to Chalk!
Improvements to Chalk are published here! See our public roadmap for upcoming changes.
The Offline Query has been enhanced with a new recompute_features
parameter. Users can control which features are sampled from the offline store, and which features are recomputed.
False
will maintain current behavior, returning only samples from the offline store. True
will ignore the offline store, and execute @online
and @offline
resolvers to produce the requested output.recompute_features
, those features will be recomputed by running @online
and @offline
resolvers, and all other feature values - including those needed to recompute the requested features - will be sampled from the offline store.The ‘recompute’ capability is also exposed on Dataset. When passed a list of features to recompute, a new Dataset Revision will be generated, and the existing dataset will be used as inputs to recompute the requested features.
Chalk has introduced a new workflow when (working with branches)[/docs/branches], allowing full iterations to take place directly in any IPython notebook. When a user creates a Chalk Client with a branch in a notebook, subsequent features and resolvers in the notebook will be deployed to that branch. When combined with Recompute Dataset and the enhancements to Offline Query, users have a new development loop available for feature exploration and development:
Deployments now offer the ability to view their source code. By clicking the “View Source” button on the Deployment Detail page, users can view all files included in the deployed code.
Users can now “redeploy” any historical deployement with a UI button on the deployment details page. This enables useful workflows including rollbacks.
The “download source” button downloads a tarball containing the deployed source to your local machine.
When writing resolvers, incorrect typing can be a difficult to track. Now, if a resolver instantiates a feature of an incorrect type, the resolver error message will include the primary key value(s) of the query itself.
The Online Query API can now be used to query DataFrame-typed features. For instance, you can query all of a user’s transaction level features in a single query:
chalk query --in user.id --out user.transactions
{
"columns": ["transaction.id", "transaction.user_id", ...],
"values": [[1, 2, 3, ...], ["user_1", "user_2", "user_3", ...]
}
More functionality will be added to Online and Offline query APIs to support more advanced query patterns.
When deploying with chalk apply
a new flag --branch <branch_name>
has been introduced which creates a branch deployment.
Users can interact with their branch deployment using a consistent name by passing the branch name to query, upload_features, etc.
Chalk clients can also be scoped to a branch by passing the branch in the constructor.
Branch deployments are many times faster than other flavors of chalk apply
, frequently taking only a few seconds from beginning to end.
Branch deployments replace preview deploys, which have been deprecated.
Deployments via chalk apply
are now up to 50% faster in certain cases. If your project’s PIP dependencies haven’t changed, new deployments will build & become active significantly faster than before.
Introduces a new “offline_ttl” property to features decorator . Now you can control for how long data is valid in the offline_store. Any feature older than the ttl value will not be returned in an offline query.
@features
class MaxOfflineTTLFeatures:
id: int
ts: datetime = feature_time()
no_offline_ttl_feature: int = feature(offline_ttl=timedelta(0))
one_day_offline_ttl_feature: int = feature(offline_ttl=timedelta(days=1))
infinite_ttl_feature: int
Adds the strict
property to features decorator, indicating that any failed validation will throw an error. Invalid features will never be written to the online or offline store is strict
is True
. Also introduces the validations
array to allow differentiated strict and soft validations on the same feature.
@features
class ClassWithValidations:
id: int
name: int = feature(max=100, min=0, strict=True)
feature_with_two_validations: int = feature(
validations=[
Validation(min=70, max=100),
Validation(min=0, max=100, strict=True),
]
)
The Dataset
class is now live!
Using the new ChalkClient.offline_query
method,
we can inspect important metadata about the query and
retrieve its output data in a variety of ways.
Simply attach a dataset_name
to the query to persist the results.
from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
input={
User.id: uids,
User.ts: [at] * len(uids),
},
output=[
User.id,
User.fullname,
User.email,
User.name_email_match_score,
],
dataset_name='my_dataset'
)
pandas_df: pd.DataFrame = dataset.data_as_pandas
Check out the documentation here.
Chalk now provides access to build and boot logs through the Deployments page in the dashboard.
Computing features associated with third-party services can be unpredictably slow. Chalk helps you manage such uncertainty by specifying a resolver timeout duration.
Now you can set timeouts for resolvers!
@online(timeout="200ms")
def resolve_australian_credit_score(driver_id: User.driver_id_aus) -> User.credit_score_aus:
return experian_client.get_score(driver_id)
SQL-integrated resolvers can be completely written in SQL files: no Python required! If you have a SQL source like as follows:
pg = PostgreSQLSource(name='PG')
You can define a resolver in a .chalk.sql
file, with comments that detail important
metadata. Chalk will process it upon chalk apply
as it would any other Python resolver.
-- type: online
-- resolves: user
-- source: PG
-- count: 1
select email, full_name from user_table where id=${user.id}
Check out the documentation here.
Logging on your dashboard has been improved. You can now scroll through more logs, and the formatting is cleaner and easier to use. This view is available for resolvers and resolver runs.
Online Query Response objects now support pretty-print in any iPython environment.
chalkpy
has always supported running in docker images using M1’s native arm64
architecture, and now
chalkpy==1.12.0
supports most functionality on M1 Macs when run with AMD64 (64 bit Linux) architecture docker images.
This is helpful when testing images built for Linux servers that include chalkpy
.
Chalk has lots of documentation, and finding content is now difficult.
We’ve added docs search!
Try it out by typing cmd-K
, or clicking the search button at the top of the
table of contents.
This update makes several improvements to feature discovery.
Tags and owners are now parsed from the comments preceding the feature definition.
@features
class RocketShip:
# :tags: team:identity, priority:high
# :owner: katherine.johnson@nasa.gov
velocity: float
...
Prior to this update, owners and tags needed to be set in the feature(...)
function:
@features
class RocketShip:
velocity: float = feature(
tags=["team:identity", "priority:high"],
owner="katherine.johnson@nasa.gov"
)
...
Feel free to choose either mechanism!
It’s natural to name the primary feature of a feature set
id
. So why do you always have to specify it?
Until now, you needed to write:
@features
class User:
id: str = feature(primary=True)
...
Now you don’t have to! If you have a feature class that does
not have a feature with the primary field set, but has a feature
called id
, it will be assigned primary
automatically:
@features
class User:
id: str
...
The functionality from before sticks around:
if you use a field as a primary key with a name other than
id
, you can keep using it as your primary feature:
@features
class User:
user_id: str = feature(primary=True)
# Not really the primary key!
id: str
The Chalk DataFrame
now supports boolean expressions!
The Chalk team has worked hard to let you express your
DataFrame
transformations in natural, idiomatic Python:
DataFrame[
User.first_name == "Eleanor" or (
User.email == "eleanor@whitehouse.gov" and
User.email_status not in {"deactivated", "unverified"}
) and User.birthdate is not None
]
Python experts will note that or
, and
, is
, is not
, not in
, and not
aren’t overload-able.
So how did we do this?
The answer is AST parsing! A more detailed blog post to follow.
This update makes several improvements to feature discovery.
Descriptions are now parsed from the comments preceding
the feature definition. For example, we can document the feature
User.fraud_score
with a comment above the attribute definition:
@features
class User:
# 0 to 100 score indicating an identity match.
# Low scores indicate safer users
fraud_score: float
...
Prior to this update, descriptions needed to be set in the feature(...)
function:
@features
class UserFeatures:
fraud_score: float = feature(description="""
0 to 100 score indicating an identity match.
Low scores indicate safer users
""")
...
The description passed to feature(...)
takes precedence over the
implicit comment description.
You can now set attributes for all features in a namespace!
Here, we assign the tag group:risk
and the owner ravi@chalk.ai
to all features on the feature class.
Owners specified at the feature level take precedence
(so the owner of User.email
is the default ravi@chalk.ai
whereas the
owner of User.flaky_api_result
is devops@chalk.ai
).
Tags aggregate, so email
has the tags pii
and group:risk
.
@features(tags="group:risk", owner="ravi@chalk.ai")
class User:
email: str = feature(tags="pii")
flaky_api_result: str = feature(owner="devops@chalk.ai")
You can configure Chalk to post message to your Slack workspace! You can find the Slack integration tab in the settings page of your dashboard.
Slack can be used as an alert channel or for build notifications.
Chalk’s pip package now supports Python 3.8! With this change, you can use the Chalk package to run online and offline queries in a Python environment with version >= 3.8. Note that your features will still be computed on a runtime with Python version 3.10.
Chalk’s injects environment variables to support data integrations. But what happens when you have two data sources of the same kind? Historically, our recommendation was to create one set of environment variables through an official data source integration, and one set of prefixed environment variables yourself using the generic environment variable support.
With the release of named integrations, you can connect to as many
of the same data source as you need!
Provide a name at the time of configuring your data source,
and reference it in the code directly.
Named integrations inject environment variables with the standard names
prefixed by the integration name (ie. RISK_PGPORT
).
The first integration of a given kind will also create the un-prefixed environment
variable (ie. both PGPORT
and RISK_PGPORT
).
Chalk is excited to announce the availability of our SOC 2 Type 1 report from Prescient Assurance. Chalk has instituted rigorous controls to ensure the security of customer data and earn the trust of our customers, but we’re always looking for more ways to improve our security posture, and to communicate these steps to our customers. This report is one step along our ongoing path of trust and security.
If you’re interested in reviewing this report, please contact support@chalk.ai to request a copy.
You can now convert Chalk’s DataFrame
to a pandas.DataFrame
and back!
Use the methods chalk_df.to_pandas()
and .from_pandas(pandas_df)
.
The 1.4.1 release of the CLI added a parameter --sample
to chalk migrate
.
This flag allows migrations to be run targeting specific sample sets.
Added spark lines to the feature and resolver tables which show a quick summary of request counts over the past 24 hours. Added status to feature and resolver tables which show any failing checks related to a feature or resolver.