Chalk home page
Docs
API
CLI
  1. Offline Queries
  2. Temporal Consistency

For model training, you often need to pull past observations of feature values that correspond to the point in time when you made that observation.

Chalk performs point-in-time lookups on your training data so that you can train your models knowing that they won’t receive information about the future, even across complex relationships.

Sampling Historical Values

Say you have features Business.sales and Business.cogs that represent the sales and cost of goods sold for a business, in millions of dollars:

from chalk.features import features, FeatureTime

@features
class Business:
    id: int
    sales: float
    cogs: float
    ts: FeatureTime

Over time, you’ve been issuing loans to businesses, and you want review your loan book to see if you could have made better decisions about which businesses were credit-worthy.

For example, maybe you gave loans to the business with id=123 at times t1 and t2:

Samples
id=123
t1
id=123
t2
Business.sales
1.3
1
Business.cogs
0.5
0.3
0.4
Time

In training, you want to know the observed gross profit and COGS for the business at the time you made the loan, without knowing the future values of those features.

For example, it would be unfair to allow ourselves to see the impending gross profit drop from $1.3M to $1M when considering what our new model would have done at time t1.

You can use Chalk’s Python Client to sample the values of Business.sales and Business.cogs at the time of the loans:

from chalk.client import ChalkClient

t1 = datetime.now() - timedelta(days=365)
t2 = datetime.now() - timedelta(days=30)

dataset = ChalkClient().offline_query(
     input={
         # Sample a single business with id=123,
         Business.id: [123, 123],
         # Sample the same business at two times: t1 and t2
         Business.ts: [t1, t2],
     },
     # Sample all features of business.
     # Alternatively, sample only the features you need:
     #   output=[Business.sales, Business.cogs]
     output=[Business],
)

Running this query will result in a Dataset with the following values:

FeatureValue at t1Value at t2
Business.sales 1.3 1
Business.cogs 0.5 0.4

Each of these values occurred at or before the sample time and is valid to use in training.


Migrations

Temporal consistency is especially difficult when you want to build new features. Continuing the example from the previous section, imagine you’ve observed Business.sales and Business.cogs many times in the past, and for each of the businesses that you track, these values have changed over time.

Now, you want to compute a new feature, Business.gross_profit, which is the difference between Business.sales and Business.cogs. You can do this by writing a function get_rev that takes Business.sales and Business.cogs as arguments and returns Business.gross_profit:

@features
class Business:
  sales: float
  cogs: float
  revenue: float

@online
def get_rev(
  sales: Business.sales,
  cogs: Business.cogs,
) -> Business.revenue:
    return sales - cogs

If you deploy this resolver with chalk apply, you’ll start calculating Business.gross_profit correctly on an ongoing basis. However, you won’t yet have values of Business.gross_profit in the past. You may want to pretend that the resolver get_rev was always computing the gross profit of the business you track, so that you can train our models on the historical value of Business.gross_profit.

To do that, you can run a migration against all of your data, or against only the samples you want to observe. For example, if you had observed the Business.sales and Business.cogs features as below, and wanted to compute Business.gross_profit at the times t1 and t2 below, Chalk would pull the latest value of each feature that occurred before the sample time:

Samples
id=123
t1
id=123
t5
Business.sales
1.3
1
Business.cogs
0.5
0.3
0.4
Samples
(1.3, 0.5)
(1, 0.4)
Time

Then, Chalk would run the get_rev resolver with the sampled values:

FeatureBusiness.salesBusiness.cogsFunction callBusiness.gross_profit
id=123 @ t1 1.3 0.5 get_rev(1.3, 0.5) 0.8
id=123 @ t2 1.0 0.4 get_rev(1.0, 0.4) 0.6

The resulting values for Business.gross_profit would be stored as having occurred at the latest observed time of all the sample inputs.

For the sample at t1, the observed at time for Business.gross_profit would be the time at which Business.gross_profit was 1.3M.

At t2, the observed at time for Business.gross_profit would be the time at which Business.cogs was seen to be 0.4, as Business.cogs was observed more recently than Business.sales was observed.

As you start nesting more resolvers, or using has-many relationships, this can become even more complex and error-prone without a framework managing the temporal consistency of your data.


Back-filling time-aware data

You can also backfill time-aware data into Chalk. For example, you may have events tables that track data changes over time. To do so, you can use feature time to specify the time at which the data was observed.

from chalk.features import FeatureTime, features

@features
class Business:
  sales: float
  cogs: float
  revenue: float
  ts: FeatureTime

If you provide the ts feature of Business when you ingest data, Chalk will use that value to determine the time at which the data was observed.

from chalk.sql import SnowflakeSource

db = SnowflakeSource()

@offline(cron='1h')
def get_historical() -> Business:
    db.query_string(
        """
        SELECT
            sales,
            cogs,
            gross_profit,
            ts
        FROM
            business
        """
    ).incremental()

Every hour, Chalk will run the get_historical resolver to check for new data. If it finds new data, it will use the ts column to determine the time at which the data was observed.

Enforcing a TTL

Features in the offline store can optionally have a ttl (time to live) applied. In the case that a feature has a ttl, it will never be returned at any time later than FeatureTime + the ttl. As an example, perhaps you don’t want to return credit scores for users which were observed more than a year ago, in this case the following feature class will return None instead of the last observed credit score if the credit score is older than 1 year.

@features
class User:
    id: str
    credit_score: int = feature(offline_ttl=timedelta(years=1))