Development
Describe and fetch rows of features.
A Chalk DataFrame
is a 2-dimensional data structure
similar to pandas.Dataframe
, but with richer underlying optimizations.
In a Chalk DataFrame
, the column headers are instances of chalk.feature
.
The Chalk DataFrame
can be used to describe rows of data
and as a wrapper for returning that data.
The Chalk DataFrame
is parameterized by the feature values it contains.
For example:
DataFrame[User.name, User.email]
The DataFrame
type is commutative on its type parameters.
This property allows you to write the type parameters of
a DataFrame
in any order:
DataFrame[User.name, User.email] == DataFrame[User.email, User.name]
You’re likely to use the Chalk DataFrame
primarily as a type,
or to handle one as the result of a SQL query.
However, you can also construct a DataFrame
directly.
Much like a pandas.DataFrame
, a chalk.DataFrame
can be constructed from a dictionary.
The keys of the dictionary are the headers and the values are the rows under each header.
The keys can be specified as either strings
(“taco.price”
)
or as their Python types
(Taco.price
).
You can convert from a feature’s Python definition to a string with a cast:
str(Taco.price) == "taco.price"
.
chalk_df = DataFrame(
{
Taco.id: ["t_1", "t_2", "t_3"],
"taco.price": [1, 2, 10],
str(Taco.contains_meat): [True, True, False],
}
)
In the above example, we use all forms of referencing a feature’s name.
You can create the same chalk.DataFrame
by passing a list of feature classes:
chalk_df = DataFrame([
Taco(id="t_1", price=1, contains_meat=True),
Taco(id="t_2", price=2, contains_meat=True),
Taco(id="t_3", price=10, contains_meat=False),
])
You can also convert to and from a pandas.DataFrame
.
import pandas as pd
DataFrame(
pd.DataFrame({
"taco.id": ["t_1", "t_2", "t_3"], # Taco.id or "taco.id" are accepted
"taco.price": [1, 2, 10],
})
)
The Chalk DataFrame
can be converted to a pandas.DataFrame
via the method .to_pandas()
:
df = DataFrame(...)
df.to_pandas()
Chalk’s SQL integrations output the type DataFrame
.
For more information, see the SQL Integration section.
pg = PostgreSQLSource()
@offline
def fn() -> DataFrame[Login.ts, Login.user_id, Login.status]:
return pg.query(
Login(
status=LoginHistorySQL.status,
ts=LoginHistorySQL.created_at,
user_id=LoginHistorySQL.user_id,
)
).all()
Method | Description |
---|---|
DataFrame.count() | The number of rows in a DataFrame |
DataFrame.mean() | The average for each column in the DataFrame † |
DataFrame.sum() | The sum of each column in the DataFrame † |
DataFrame.max() | The maximum value for each column in the DataFrame † |
DataFrame.min() | The minimum value for each column in the DataFrame † |
† There will be only one row in the DataFrame
after this operation, one for each
value. If the DataFrame
contained only a single column to start,
then the value returned is a scalar.
You can select columns out of a DataFrame
from the set
of columns already present to produce a new DataFrame
scoped down to those columns. For example:
df: DataFrame[User.name, User.email, User.age] = ...
projected = df[User.email, User.age]
# type(projected) == DataFrame[User.email, User.age]
In addition to selecting columns, you can
filter the rows
of a DataFrame
.
The example below restricts the rows of a DataFrame
to only rows where
the User
is over 21 years old:
df: DataFrame[User.name, User.email, User.age]
f = df[User.age >= 21]
# type(f) == DataFrame[User.name, User.email, User.age]
Filtering a DataFrame
keeps all the existing columns,
but drops rows where the predicate is not met.
The Chalk DataFrame
supports the standard comparison
functions
(<
, <=
, >
, >=
, in
, is
, is not
, ==
, and !=
)
for features and their matching scalars values.
You can compose these filters with the boolean operations
and
, or
, and not
.
Operation(s) | Example | Comment |
---|---|---|
>, <, >=, <= | User.age > 21 | |
==, != | User.age is None | AST-only feature.† |
is, is not | User.age is None | AST-only feature.† |
in, not in | User.age in {17, 18} | AST-only feature.† |
or, and | (User.state == “AL” and User.age >= 17) | AST-only feature.† |
not | not (User.state == “AL” and User.age >= 17) | AST-only feature.† |
† The marked operations must be provided as direct arguments.
They cannot first be assigned to a variable, and later provided as an argument.
For example, DataFrame[User.age in {17, 18}]
is valid, but
defining x = User.age in {17, 18}
and then running DataFrame[x]
is not valid.
The reason for this is that Python does not allow overriding these operations.
However, Chalk’s aim is to allow developers to inspect and support natural Python syntax.
To that end, Chalk parses the AST to override these operations, with the restriction
that they are provided as arguments directly.
To filter on multiple columns, separate your filters by a comma:
df[User.age > 21, User.email == "joe@chalk.ai"]
Alternatively, use the and
keyword between filters:
df[User.age > 21 and User.email == "joe@chalk.ai"]
Similarly, you can perform or
queries with Python’s
or
keyword:
df[User.age > 21 or User.email == "joe@chalk.ai"]
All of these can be composed:
df[
User.age > 21 and (
User.email == "joe@chalk.ai" or
User.full_name is None
)
]
Filtering may be combined with projection.
For example, you can select all the values for User.email
where the user is at least 21 years old as follows:
df: DataFrame[User.name, User.email, User.age]
projected = df[User.email, User.age > 21]
# type(projected) == DataFrame[User.email]
If you’d like to filter on a value and also return the value,
you need to explicitly select it. You can amend the
above example to include the User.age
feature like so:
df: DataFrame[User.name, User.email, User.age]
projected = df[User.email, User.age, User.age > 21]
# type(projected) == DataFrame[User.email, User.age]
Chalk’s DataFrame
is designed to support efficient operations
across a variety of underlying data sources. DataFrame
vectorizes scalar operations on data that is kept resident in memory. Chalk
calls operations on data that is loaded into memory “strict execution”.
In coordination with Chalk’s Execution Engine, DataFrame
is capable of
pushing down filtering, projection, and aggregation operations to
underlying data sources. Chalk calls this style of execution “lazy execution”.
Suppose that we use a DataFrame
to query a SQL source:
@online
def get_return_count(
transfers: User.transfers[Transfer.status == "returned", after(days_ago=60)]
) -> User.returned_transactions_last_60:
return transfers.count()
This ultimately results in (approximately) the following SQL query being executed:
SELECT COUNT(*) from transfers
WHERE status = 'returned' and
transfers.ts > current_date - interval '60' day;
This “push down” mechanism in lazy computation helps make
computationally expensive operations execute quickly, since Chalk only needs to load
a single integer from the underlying data source, instead of potentially all Transfer
rows.