Time
Manage feature timestamps and point-in-time query timestamps.
Chalk uses two main timestamps you should be aware of as you build your Chalk project:
FeatureTime
): The time at which a feature was observed, which is used in query time filters and
aggregation time windows.Now
): The time a query should assume is “now” when retrieving data. Features whose feature time comes
after a given query time will never be returned for those queries, in online or offline contexts.Feature time is returned in the __observed_at__
column in Chalk query results. Query time is returned in the __ts__
column.
Feature time is the time at which a feature was observed. By default, Chalk sets a feature’s time to the feature’s resolver execution time. The feature time can be overridden for a feature class, accessed from resolver parameters, and requested in query inputs and outputs.
You may have multiple timestamps associated with your data. It’s important to set the feature time to the value that most closely represents when your system would have accessed the data in production.
For example, in an asynchronous streaming system, you may have one timestamp for when an event was added to your task queue and another timestamp for when the event was removed from the queue and processed. We recommend using the latter timestamp as your feature time to make your training most closely resemble production. If you use the timestamp for when an event was added to your task queue, you would be training your system with data it would not have been able to access in production.
Each individual feature has its own feature time, which is used to retrieve point-in-time correct data for temporal consistency.
To access the latest time associated with any feature in a feature class, use the special FeatureTime
feature
throughout Chalk.
By default, if it is present, Chalk will treat a feature named ts
with datetime.datetime
type as the FeatureTime
value. Otherwise, you can use the FeatureTime
type annotation to set a different name:
from chalk.features import features, FeatureTime
@features
class User:
id: int
name: str
timestamp: FeatureTime # datetime.datetime under the hood
Using your FeatureTime
feature, you can access and override feature time for the whole feature class.
You can access the FeatureTime
as a resolver input. In this example, ts
will be set to the maximum feature time for
all features passed as resolver parameters:
@offline
def fn(name: User.name, ts: User.timestamp) -> ...:
You can directly set the FeatureTime
value by returning it from a resolver:
@offline
def fn(...) -> Features[User.name, User.timestamp]:
return User(
name="Maryam Mirzakhani",
ts=datetime(2014, 8, 12, tzinfo=timezone.utc)
)
You can include the FeatureTime
feature in query output
. Its value
will be set to the maximum timestamp across all features in its feature class.
has-many features create DataFrames. These DataFrames can be filtered with
before
and after
.
Regardless of which time filters you use, Chalk will never return features where the feature time is strictly greater
than the current query time (Now
), in order to maintain temporal consistency.
To compute the number of transfers a user made in the last seven days, use after
:
from chalk.features import after, ...
@online
def fn(transfers: User.transfers[after(days_ago=7)]) -> ...:
return transfers.count()
To compute the number of transfers a user made more than seven days ago, use before
:
from chalk.features import before, ...
@online
def fn(transfers: User.transfers[before(days_ago=7)]) -> ...:
return transfers.count()
Combine before
and after
to retrieve transfers made 1-2 weeks ago:
from chalk.features import before, after, ...
@online
def fn(transfers: User.transfers[after(days_ago=14), before(days_ago=7)]) -> ...:
return transfers.count()
All of these examples can be used in combination with other DataFrame projections and filters. You may also find windowed aggregations useful.
Features with overriden observation timestamps are treated specially when inserted into the online store. In particular, Chalk will always check for existing “newer” feature values in the online store before inserting historically dated feature values. This means that you can safely ingest large quantities of backdated features without accidentally ingesting stale data into the online store.
Additionally, once features are inserted into the online store, Chalk tracks the source observation timestamps when these feature values are returned as part of online queries. Chalk uses these source timestamps to compute the “feature staleness” metric. Staleness in this context is defined as “query time - observation time”.
Features with overriden observation timestamps are inserted into the offline store with the timestamp that you specify. The observation timestamp works like an “effective as of” timestamp, so if you insert something like this:
| id | feature | value | timestamp |
|---------------------------------------------|
| 1 | age | 7 | 2022-02-01T00:00:00Z |
into an offline store that already contained these observations:
| id | feature | value | timestamp |
|---------------------------------------------|
| 1 | age | 6 | 2022-01-01T00:00:00Z |
| 1 | age | 8 | 2022-03-01T00:00:00Z |
then the observation will be interleaved “in between” the existing observations, and you would see the following query results:
id, age, <= 2022-02-01
output: 7
id, age, <= 2022-03-02
output: 8
id, age, <= 2022-01-02
output: 6
Features in the offline store have an optional TTL (time to live). When a feature has a TTL value, it will never be
returned at any time later than FeatureTime
+ TTL. For example, you may not want to consider credit scores which were
retrieved more than a year ago. Setting offline_ttl
will make credit_score
return
None if last observed credit score is more than one year old in comparison to the current query time.
@features
class User:
id: str
credit_score: int = feature(offline_ttl=timedelta(years=1))
Query time is the time treated as “now” within a query context. For online queries, Now
is equal to datetime.now()
.
For offline queries, you can pass one or more timestamps that will be used as the query time for each input row.
In training, you will likely want to retrieve data as if you are at a point in the past to create the most accurate predictions. We cover this idea in greater detail in our temporal consistency documentation.
To set the query’s “now” time, pass input_times
as either a single
timestamp or as a list corresponding to the Now
times to use for each entry in input
:
from datetime import timezone
ChalkClient().offline_query(
# Pass id 1 multiple times because we want to
# request it with multiple input_times
input={User.id: [1, 1, 1]},
input_times=[
datetime.now(tz=timezone.utc) - timedelta(days=365 * 10),
datetime.now(tz=timezone.utc) - timedelta(days=365),
datetime.now(tz=timezone.utc) - timedelta(days=0),
],
output=[User.age_in_years],
)
## Output:
# | id | age_in_years |
# | 1 | <age> - 10 |
# | 1 | <age> - 1 |
# | 1 | <age> - 0 |
To access the query time in your resolvers, you can reference a special feature called Now
, which is a
datetime.datetime
object.
You can pass Now
in Python resolvers:
from chalk import Now
@online
def get_age_in_years(birthday: User.birthday, now: Now) -> User.age_in_years:
return (now - birthday).years
Now
can be used in DataFrame resolvers as well in order to compute bulk values:
@online
def batch_get_age_in_years(df: DataFrame[User.id, User.birthday, Now]) -> DataFrame[User.id, User.age_in_years]:
return (
df.to_polars()
.select(
pl.col(User.id),
pl.col(str(User.birthday) - pl.col(str(Now))).alias(str(User.age_in_years))
)
)
You can also reference ${now}
in SQL file resolvers. If Now
is used in a resolver to compute a has-many join, then
the Now
feature must be passed as input.
-- source: sql_file_resolver_temp_db
-- resolves: tv_episode
select id,
name,
season_no,
episode_no,
show_name,
air_date
from tv_episodes
where air_date < ${now}
and id = ${tv_episode.id}
Chalk Datasets return the query time in the __ts__
column.
When converting Chalk Datasets to Polars or Pandas DataFrames, you may want to include the query time column. To do so,
pass output_ts
to your to_polars
or to_pandas
calls.
You may pass a column name to output_ts
to set the name of the query time column. If you pass True
, the query time
column name will be __chalk__.CHALK_TS
.
Be careful to not mix up __ts__
and ts
. __ts__
represents the query time, or the time the query treats as "now" during query execution. ts
is a common name for the feature representingFeatureTime
, the time at which a feature was observed.
Chalk stores UTC as the timezone for naive datetime objects. Additionally, Chalk assumes UTC if retrieving naive datetimes from data stores.
We recommend that you include timezone information on all datetime objects you work with to avoid ambiguity.