Offline Queries
Point-in-time queries for training data sets.
For model training, you need to pull past observations of feature values that correspond to the point in time when you made an observation.
Chalk performs point-in-time lookups on your training data so that you can train your models knowing that they won’t receive information about the future, even across complex relationships.
Say you have features Business.sales
and Business.cogs
that represent the sales and cost of goods sold for a business,
in millions of dollars:
from chalk.features import features, FeatureTime
@features
class Business:
id: int
sales: float
cogs: float
ts: FeatureTime
Over time, you’ve been issuing loans to businesses, and you want review your loan book to see if you could have made better decisions about which businesses were credit-worthy.
For example, maybe you gave loans to the business with id=123
at times t1
and t2
:
In training, you want to know the observed gross profit and COGS for the business at the time you made the loan, without knowing the future values of those features.
For example, it would be unfair to allow ourselves to see
the impending gross profit drop from $1.3M to $1M when considering what
our new model would have done at time t1
.
You can use Chalk’s Python Client
to sample the values of Business.sales
and Business.cogs
at the time of the loans:
from chalk.client import ChalkClient
t1 = datetime.now() - timedelta(days=365)
t2 = datetime.now() - timedelta(days=30)
dataset = ChalkClient().offline_query(
input={
# Sample a single business with id=123,
Business.id: [123, 123],
# Sample the same business at two times: t1 and t2
Business.ts: [t1, t2],
},
# Sample all features of business.
# Alternatively, sample only the features you need:
# output=[Business.sales, Business.cogs]
output=[Business],
)
Running this query will result in a Dataset with the following values:
Feature | Value at t1 | Value at t2 |
---|---|---|
Business.sales | 1.3 | 1 |
Business.cogs | 0.5 | 0.4 |
Each of these values occurred at or before the sample time and is valid to use in training.
Temporal consistency is especially difficult when you want
to build new features. Continuing the example from the previous section,
imagine you’ve observed Business.sales
and Business.cogs
many times in the past, and for each of the businesses
that you track, these values have changed over time.
Now, you want to compute a new feature, Business.gross_profit
,
which is the difference between Business.sales
and
Business.cogs
. You can do this by writing a function
get_rev
that takes Business.sales
and
Business.cogs
as arguments and returns Business.gross_profit
:
@features
class Business:
sales: float
cogs: float
revenue: float
@online
def get_rev(
sales: Business.sales,
cogs: Business.cogs,
) -> Business.revenue:
return sales - cogs
If you deploy this resolver with chalk apply,
you’ll start calculating Business.gross_profit
correctly on an
ongoing basis.
However, you won’t yet have values of Business.gross_profit
in
the past. You may want to pretend that the resolver get_rev
was always computing the gross profit of the business you track, so that
you can train our models on the historical
value of Business.gross_profit
.
To do that, you can run a migration
against all of your data, or against only the samples
you want to observe.
For example, if you had observed the Business.sales
and Business.cogs
features as below, and wanted
to compute Business.gross_profit
at the times t1
and t2
below, Chalk would pull the latest value
of each feature that occurred before the sample time:
Then, Chalk would run the get_rev
resolver
with the sampled values:
Feature | Business.sales | Business.cogs | Function call | Business.gross_profit |
---|---|---|---|---|
id=123 @ t1 | 1.3 | 0.5 | get_rev(1.3, 0.5) | 0.8 |
id=123 @ t2 | 1.0 | 0.4 | get_rev(1.0, 0.4) | 0.6 |
The resulting values for Business.gross_profit
would be stored
as having occurred at the latest observed time of all the
sample inputs.
For the sample at t1
, the observed at time
for Business.gross_profit
would be the time at which Business.gross_profit
was 1.3M.
At t2
, the observed at time for Business.gross_profit
would
be the time at which Business.cogs
was seen to be 0.4
, as
Business.cogs
was observed more recently than Business.sales
was observed.
As you start nesting more resolvers, or using has-many relationships, this can become even more complex and error-prone without a framework managing the temporal consistency of your data.
You can also backfill time-aware data into Chalk. For example, you may have events tables that track data changes over time. To do so, you can use feature time to specify the time at which the data was observed.
from chalk.features import FeatureTime, features
@features
class Business:
sales: float
cogs: float
revenue: float
ts: FeatureTime
If you provide the ts
feature of Business
when you ingest data,
Chalk will use that value to determine the time at which
the data was observed.
from chalk.sql import SnowflakeSource
db = SnowflakeSource()
@offline(cron='1h')
def get_historical() -> Business:
db.query_string(
"""
SELECT
sales,
cogs,
gross_profit,
ts
FROM
business
"""
).incremental()
Every hour, Chalk will run the get_historical
resolver
to check for new data. If it finds new data, it will
use the ts
column to determine the time at which
the data was observed.
Features in the offline store can optionally have a ttl
(time to live) applied. In the case that a feature has a ttl, it will never be returned at any time later than FeatureTime
+ the ttl.
As an example, perhaps you don’t want to return credit scores for users which were observed more than a year ago, in this case the following feature class will return None instead of the last observed credit score if the credit score is older than 1 year.
@features
class User:
id: str
credit_score: int = feature(offline_ttl=timedelta(years=1))