Time
Point-in-time queries for training data sets.
Temporal consistency is crucial for ensuring your model’s training performance is reflective of production. When creating datasets for model training, your system must consistently retrieve data as it would appear at a specific point in time. If your model trains on data that was retrieved after the point where it would have made a prediction, it will not perform consistently in production.
Chalk performs point-in-time lookups on your training data so you can train your models knowing they won’t receive data from the future, even across complex relationships.
In this scenario, we will consider a machine learning model for determining whether to issue loans to businesses based on their financial performance.
For each business, we track sales and COGS (cost of goods sold) over time in millions of dollars. We define a Business
feature class with sales
and cogs
features:
from chalk.features import features, FeatureTime
@features
class Business:
id: int
sales: float
cogs: float
ts: FeatureTime
After collecting historical data on the performance of businesses we previously considered, we want to retrain our model.
For a business with id=123
, we gave loans at t1
and t2
:
In training, you want to know the observed gross profit and COGS for the business at the time you made the loan without knowing the future values of those features.
For example, at t1, when we issued the loan, we had observed sales of $1.3M. It would be inconsistent for this model to
train with the $1M data point because this data would not have existed at t1
.
At t1
and t2
, the temporally consistent values of sales
and cogs
would be as follows:
Feature | Value at t1 | Value at t2 |
---|---|---|
Business.sales | 1.3 | 1 |
Business.cogs | 0.5 | 0.4 |
Each of these values occurred at or before the sample time and is valid to use in training.
Chalk allows you to control the time your query considers to be “now”, which is covered in greater detail in our time documentation.
To set the query’s “now” time, pass input_times
to
your offline queries:
from chalk.client import ChalkClient
t1 = datetime.now() - timedelta(days=365)
t2 = datetime.now() - timedelta(days=30)
dataset = ChalkClient().offline_query(
input={
# Sample business 123 twice because we have two input_times
Business.id: [123, 123],
},
# Each element of `input_times` corresponds to the element
# with the same index in `input`
input_times=[t1, t2],
# Sample all features of business.
# Alternatively, sample only the features you need:
# output=[Business.sales, Business.cogs]
output=[Business],
)
This query will return a Dataset with temporally consistent values for each input time.
Temporal consistency is especially difficult when you want to build new features. Building on the first scenario,
imagine you’ve observed Business.sales
and Business.cogs
many times in the past, and for each of the businesses that
you track, these values have changed over time.
Now, you want to compute a new feature, Business.gross_profit
, which is the difference between Business.sales
and
Business.cogs
. We can add an underscore expression to compute this new feature:
from chalk.features import features, FeatureTime
from chalk.features import _, features, FeatureTime
@features
class Business:
id: int
sales: float
cogs: float
ts: FeatureTime
gross_profit: float = _.sales - _.cogs
With the same historical data from the previous scenario, we can determine the correct inputs for computing
Business.gross_profit
:
Business.gross_profit
at t1
and t2
would be computed with the following values:
Query time | Business.sales | Business.cogs | Business.gross_profit |
---|---|---|---|
t1 | 1.3 | 0.5 | 0.8 |
t2 | 1.0 | 0.4 | 0.6 |
In your code, you can compute historical values for updated features by running an offline query with
recompute_features=True
. You may also consider one of our
other backfill options for other use cases.
As you start layering more resolvers or using has-many relationships, temporal consistency becomes even more difficult to reason about. Chalk manages the complexity under the hood so that you can write any query and expect temporally consistent results.