For model training, you often need to pull past observations of feature values that correspond to the point in time when you made that observation.

Chalk performs point-in-time lookups on your training data so that you can train your models knowing that they won’t receive information about the future, even across complex relationships.

Sampling Historical Values

Say you have features Business.sales and Business.cogs that represent the sales and cost of goods sold for a business, in millions of dollars:

from chalk.features import features, FeatureTime

class Business:
    id: int
    sales: float
    cogs: float
    ts: FeatureTime

Over time, you’ve been issuing loans to businesses, and you want review your loan book to see if you could have made better decisions about which businesses were credit-worthy.

For example, maybe you gave loans to the business with id=123 at times t1 and t2:


In training, you want to know the observed gross profit and COGS for the business at the time you made the loan, without knowing the future values of those features.

For example, it would be unfair to allow ourselves to see the impending gross profit drop from $1.3M to $1M when considering what our new model would have done at time t1.

You can use Chalk’s Python Client to sample the values of Business.sales and Business.cogs at the time of the loans:

from chalk.client import ChalkClient

t1 = - timedelta(days=365)
t2 = - timedelta(days=30)

dataset = ChalkClient().offline_query(
         # Sample a single business with id=123, [123, 123],
         # Sample the same business at two times: t1 and t2
         Business.ts: [t1, t2],
     # Sample all features of business.
     # Alternatively, sample only the features you need:
     #   output=[Business.sales, Business.cogs]

Running this query will result in a Dataset with the following values:

FeatureValue at t1Value at t2
Business.sales 1.3 1
Business.cogs 0.5 0.4

Each of these values occurred at or before the sample time and is valid to use in training.


Temporal consistency is especially difficult when you want to build new features. Continuing the example from the previous section, imagine you’ve observed Business.sales and Business.cogs many times in the past, and for each of the businesses that you track, these values have changed over time.

Now, you want to compute a new feature, Business.gross_profit, which is the difference between Business.sales and Business.cogs. You can do this by writing a function get_rev that takes Business.sales and Business.cogs as arguments and returns Business.gross_profit:

class Business:
  sales: float
  cogs: float
  revenue: float

def get_rev(
  sales: Business.sales,
  cogs: Business.cogs,
) -> Business.revenue:
    return sales - cogs

If you deploy this resolver with chalk apply, you’ll start calculating Business.gross_profit correctly on an ongoing basis. However, you won’t yet have values of Business.gross_profit in the past. You may want to pretend that the resolver get_rev was always computing the gross profit of the business you track, so that you can train our models on the historical value of Business.gross_profit.

To do that, you can run a backfill against all of your data, or against only the samples you want to observe. For example, if you had observed the Business.sales and Business.cogs features as below, and wanted to compute Business.gross_profit at the times t1 and t2 below, Chalk would pull the latest value of each feature that occurred before the sample time:

(1.3, 0.5)
(1, 0.4)

Then, Chalk would run the get_rev resolver with the sampled values:

FeatureBusiness.salesBusiness.cogsFunction callBusiness.gross_profit
id=123 @ t1 1.3 0.5 get_rev(1.3, 0.5) 0.8
id=123 @ t2 1.0 0.4 get_rev(1.0, 0.4) 0.6

The resulting values for Business.gross_profit would be stored as having occurred at the latest observed time of all the sample inputs.

For the sample at t1, the observed at time for Business.gross_profit would be the time at which Business.gross_profit was 1.3M.

At t2, the observed at time for Business.gross_profit would be the time at which Business.cogs was seen to be 0.4, as Business.cogs was observed more recently than Business.sales was observed.

As you start nesting more resolvers, or using has-many relationships, this can become even more complex and error-prone without a framework managing the temporal consistency of your data.

Back-filling time-aware data

You can also backfill time-aware data into Chalk. For example, you may have events tables that track data changes over time. To do so, you can use feature time to specify the time at which the data was observed.

from chalk.features import FeatureTime, features

class Business:
  sales: float
  cogs: float
  revenue: float
  ts: FeatureTime

If you provide the ts feature of Business when you ingest data, Chalk will use that value to determine the time at which the data was observed.

from chalk.sql import SnowflakeSource

db = SnowflakeSource()

def get_historical() -> Business:

Every hour, Chalk will run the get_historical resolver to check for new data. If it finds new data, it will use the ts column to determine the time at which the data was observed.

Enforcing a TTL

Features in the offline store can optionally have a ttl (time to live) applied. In the case that a feature has a ttl, it will never be returned at any time later than FeatureTime + the ttl. As an example, perhaps you don’t want to return credit scores for users which were observed more than a year ago, in this case the following feature class will return None instead of the last observed credit score if the credit score is older than 1 year.

class User:
    id: str
    credit_score: int = feature(offline_ttl=timedelta(years=1))