Queries
Querying Chalk with SQL.
Chalk is testing a direct SQL query interface for your feature values and
data sources in your environment. In the dashboard, under Data Explorer,
you can find several interfaces that you can use to query your data. The
Online Query Explorer and Offline Query Explorer enable you to execute
online queries and offline queries,
respectively, from the dashboard. In addition, you can use the SQL Query
Interface to query against previously computed feature values from your
offline store as well as data sources in your environment.
Additionally, you can use the GRPC client’s run_sql() endpoint.
Chalk’s SQL interface uses a standard SQL dialect, and all queries are processed through a parser that expects familiar SQL syntax.
Currently, Chalk SQL only supports making SELECT queries, with the following restrictions:
SELECT list:. delimiter, you may reference it with
double quotes, e.g. SELECT "column.with.dot" FROM table.FROM clause:read_parquet(path: str): Reads a Parquet file from the given path and returns a table. Path must be a literal string.query_values_from_operation_ids(operation_id: str, ...): Returns a table consisting of the values queried in the
given operation id(s). The operation id(s) provided must correspond to queries that have the same output schema.WHERE clause:JOINs are not possible yet.WHERE clause, but not all expressions
are guaranteed to be pushed down to the underlying data source.GROUP BY clause:GROUP BY clause.LIMIT clause:ORDER BY clause:ORDER BY clause.In the SQL Interface, there exist various schema providers, which all exist under the chalk catalog:
chalk.datasources: directly query against the data sources configured in your
environment (e.g. Postgres, Snowflake, BigQuery, etc.). Table names are tables within the data sources themselves.
Schema: Exactly that of the table in the data source.
chalk.historical_values: query against historically observed feature values with table names of the form
feature_namespace.feature_name from your offline store, or query for the freshest values for each primary key
for a specific feature namespace with table names feature_namespace
Schema for feature_namespace.feature_name:
observation_id, string, non-nullable: The unique identifier for the observation of the feature value.pkey, (type of feature namespace’s pkey), non-nullable: The value of the primary feature associated with this
observationvalue, (type of feature namespace’s value): The observed value of the feature.value_json, string, DEPRECATED: The observed value of the feature, as JSON, for backwards compatibility.observed_at, timestamp, non-nullable: The time at which the feature value was observed.inserted_at, timestamp, non-nullable: The time at which the feature value was inserted into the offline store.resolver_fqn, string, nullable: The resolver that computed this feature.deployment_id, string, nullable: The deployment id where the observation was madeSchema for feature_namespace:
__pkey__: (type of feature namespace’s pkey), non-nullable: The value of the primary feature associated with this row<feature-name>, (type of feature), for each feature: The freshest observed value of the feature for the primary key.<feature-name>.__observed_at__, timestamp, for each feature: The time at which the freshest value was observed.chalk.query_log: A schema provider consisting of one table, data, which is the query log table.
This table provides a list of query metadata for each operation id, including which tables in the offline store
correspond to values queried in that operation id.
Schema for data:
operation_id, string, non-nullable: The unique identifier for the operation.environment_id, string, non-nullable: The environment id where the operation was run.deployment_id, string: The deployment id where the operation was run.operation_kind, string, The kind of operation that was run (inference, streaming, etc.).query_timestamp, timestamp: The value of now() that was used by the query.execution_started_at, timestamp: The time at which the operation started executing. Null if there were errors during planning.execution_finished_at, timestamp: The time at which the operation finished executing. Null if there were errors during execution.query_status, string: The status of the query (success, error, running, etc.).query_name, string: The query name for named queries, null otherwise.query_name_version, string: The version of the query for named queries, null otherwise.branch_name: string: The branch the query was run on, null if the query was not run on a branch.correlation_id, string: The correlation id for the query, null if the query was not run with a correlation id.value_tables: array of strings: The list of tables in the offline store that correspond to result values from this query.chalk.resolvers: Trigger a resolver run and return the outputs of a specific resolver, with additional filtering
based on the SQL query provided. A table exists for every eligible resolver. To run a resolver this way, it must not
take any inputs and return a dataframe consisting of only scalar outputs.
Schema: Exactly that of the resolver.
Chalk also offers a chalk.information_schema that you can query (similar to many other database offerings), and queries such as SHOW TABLES and SHOW SCHEMAS can also be used.
If you had the following feature class:
@features
class User:
id: int
name: strYou could query observed feature values for User.name with user.id = 1 by running the following query:
SELECT value FROM "chalk.historical_values.user.name" WHERE pkey = 1