Chalk home page
Docs
SDK
CLI
  1. Infrastructure
  2. Resource Configuration

Overview

The Settings page of the dashboard exposes many useful configurations that you can use to tune your Chalk environments appropriately to handle your specific workload and latency requirements.

Kubernetes

Under Settings > Kubernetes, you can view all the pods in your Kubernetes cluster under the namespace for your environment, in addition to their phase and creation timestamp. By clicking on the pod name, you can also view the Kubernetes Pod View. This will include data about the pod, such as the resource request, IP address, and resource group, as well as data about the node on which the pod is scheduled. In the Pod View, you can also see the logs for the pod, which can be useful for debugging errors on branch deployments or job pods.

Nodepools

Under Settings > Nodepools, you can view and configure the nodepools used to run your Chalk environment. A NodePool defines constraints on the Kubernetes nodes that can be scheduled in your Kubernetes cluster. Notably, you can select the machine families that you would like to use in your deployment, as different machine families are optimized for different workloads. You can also set a CPU limit on your NodePool. This page will allow you to provision the resources and machine families that are best suited for your workload and usage requirements.

When configuring a nodepool, you can also select whether to isolate the nodepool and whether to restrict the nodepool to Chalk workloads only. Nodepool isolation means that workloads cannot be scheduled in the nodepool unless they are specifically configured to do so under Settings > Resources, which can be useful to minimize interference or reserve certain instance types for specific workloads. In addition to isolating the nodepool, you can also restrict the nodepool to Chalk workloads only, which can be useful for environments deployed on clusters where Chalk is running alongside other unrelated workloads to ensure that you are not billed for nodes allocated for non-Chalk workloads.

Once you have configured your Nodepools, under Settings > Resources, you can select the Nodepool to use for each service in each resource group in your environment, as well as the Pod Disruption Budget.

Resources

Under Settings > Resources, you will find the Cloud Resource Configuration page, which enables you to set the default resource requests, autoscaling, and limits for each service in each resource group in your Chalk environment.

All Chalk environments will have a Default resource group. If you have a specific use case which would require more resource groups, please reach out to the Chalk Team.

In a resource group, you can specify the default resource requests, instance types, and nodepools for each service. The resource requests for the Query Server, gRPC Query Server, and the Branch Server will be used whenever your configuration is applied to spin up the appropriate resources in your cluster. The resource request for Async Offline Query can be overridden at query time using the ChalkClient.offline_query(resources=...) parameter, but will default to the values set in the Resource Configuration page. For most use cases, setting the Requests configurations will be sufficient. However, you can also set Limits for each service if you would like to cap the maximum resources that can be allocated to a service. The Instance Type selector enables you to specify the machine family you would like to use for your deployment. For example, this would map to EC2 Instance Types for AWS deployments or Compute Engine Machine Families for GCP deployments.

How are Pods and Nodes Provisioned?

In the Settings page of the dashboard, you can configure the nodepools and resource configurations that are used in combination to provision resources in your cluster. Whenever there is a new pod that needs to be spun up, triggered either by a manual trigger, cron schedule, or autoscaling, then based on the Resource Configurations set for the service for the pod being spun up, we will create a pod with resource requests, node selectors, affinities, and anti-affinities. The Kubernetes Scheduler will then attempt to place the pod on an existing node. If it cannot find a node that satisfies the pod’s configurations, then Karpenter will analyze the pod, as well as the other nodes in the cluster and provision or deprovision nodes such that the new pod can be scheduled.

For AWS-hosted clusters, we use Karpenter Nodepools and for GCP-hosted clustesr we use NAP. Both Karpenter and GKE NAP will attempt to optimize resource provisioning based on the provided nodepool constraints. You can manage these nodepool constraints in a few ways:

  1. Edit your existing nodepools: Under Settings > Nodepools, you can edit the nodepool constraints to restrict or widen the nodes that Karpenter can produce.
  2. Create new nodepools: Under Settings > Nodepools, you can create new nodepools with more restrictive node constraints, marking them as “isolated” to prevent services from scheduling on them, and configure one or more services to run only on those nodepools.
  3. Specify Nodes in Service Requests: Under Settings > Resources, you can make services request specific kinds of machines under the Service Isolation pane.

While customers on GCP and AWS can both create nodepools through the Nodepools tab in the dashboard, the nodepools that you can create differ. Typically, you would define Karpenter nodepools to provision a range of nodes, whereas NAP in GCP would create nodepools in response to pods that need to be scheduled, where each nodepool can create exactly one kind of node that can scale by the number of node instances. Chalk enables customers to create nodepools regardless of cloud provider, however customers on GCP would have to choose between allowing NAP to make its own decisions about the types of nodes to provision or to define your own set of nodepools to force services to run on specific instance types.

When choosing how to constrain pod resource requests and nodepools, it is possible to request more resources than an instance type has available. The Chalk dashboard provides warnings when the resources requested are impossible to schedule on any available nodes, but it can be difficult to be estimate the amount of overhead on each Kubernetes node. Typically, selecting a resource request that aims for 75% of stated node capacity should be reasonable.

Autoscaling

Chalk supports autoscaling of resources in your environment in order to optimize resource usage.

For the branch server, you can set the Auto Shutdown Period, which will determine the duration of inactivity (no queries or deploys) after which the branch server pod will be automatically shut down.

For the Query Server and gRPC Query Server which are used to serve your production traffic, you can configure KEDA-based autoscaling. Autoscaling for either of these resources will be constrained depending on the Nodepool configurations for your environment. Auto-scaling can be triggered based on CPU utilization or on a schedule.

CPU-Based Autoscaling

To enable CPU-based autoscaling, you can set the Target CPU % in the Cloud Resource Configuration. Chalk will then monitor the CPU utilization of the service and scale the number of instances up or down based on the target CPU percentage. In order for CPU-based autoscaling to work, you must have the Min Instances, Max Instances, and Target CPU% all set for the service.

Scheduled Autoscaling

To enable scheduled autoscaling, you can modify the spec config in the JSON tab of the Cloud Resource Configuration page. In order to do scheduled scaling, you must have KEDA enabled in your environment. If you are usnure whether KEDA is enabled, please reach out to the Chalk team.

In the JSON tab, you can view the same configurations from the form fields for each of your services. Depending on whether you want to configure scheduled autoscaling for your Query Server or your gRPC Query Server, you can edit the spec for engine or engine-grpc respectively.

Let’s say that each day, you want to scale up your Query Server to somewhere between 50 and 100 replicas from 8:45 AM to 10:15 AM on weekdays to handle a daily traffic spike at 9 AM ET. The following configuration will enable such scaling on the cron schedule, while returning to the default configurations outside of that time window.

"engine": {
    "autoscaler_config": {
        "default_timezone": "America/New_York",
        "max_replica_count": 100,
        "min_replica_count": 50,
        "triggers": [
            {
                "desired_replicas": 100,
                "end": "15 10 * * 1-5",
                "start": "45 8 * * 1-5"
            },
        ]
    },
    "deploy_engine_proxy": false,
    "enable_engine_proxy": false,
    "limit": {},
    "min_instances": 20,
    "node_selector": {
        "cloud.google.com/gke-nodepool": "your-default-nodepool"
    },
    "request": {
        "cpu": "5",
        "memory": "10Gi"
    },
    "worker_count": 8
}

Shared Resources

Under Settings > Shared Resources, environment admins can view and manage shared resources across different services in your environment, including the Metrics Database configuration, Gateway configuration, and Background Persistence configurations. You can adjust these configurations here if, for example, you need to scale up your Metrics Database to handle larger volumes of data, or if you drastically increase the volume of data being written to the online or offline stores.

For enabling autoscaling of Background Persistence workers to handle variations in data being persisted online and offline, you can set the Horizontal Pod Autoscaling (HPA) Settings in the Background Persistence Configuration tab. In order to enable autoscaling, it is required to set

  • Max replicas
  • Min replicas
  • Target average value (the approximate target undelivered message count in queue per replica)
  • Pubsub Subscription ID

Connections

Under Settings > Connections, you can view all the connections that your environment has to external resources, such as your online and offline stores, and the branch server. This is a good place to check the health of the different services in your environment, as well as to verify the types of connections configured. Significantly, you can also scale up

Environment Variables

Under Settings > Environment > Variables, you can view and edit global environment variables for your environment. Please reach out to the Chalk team if you have any questions about which environment variables to set for your use case.

User Permissions and RBAC

Under Settings > User Permissions, you can view the roles associated with each user, as well as whether those roles are granted directly or via SCIM. When adding new users to your Chalk environment, you can assign them roles that will determine their permissions in the environment. The available roles in order of increasing permissions are:

  • Viewer: Read the web portal and create new alerts.
  • Data Scientist: Run queries, branch deploy, + everything that a Viewer can do.
  • Developer: Run queries, run migrations + everything that a Data Scientist can do.
  • Admin: Create deployments, service tokens, and secrets + everything that a Developer can do.
  • Owner: Manage team members + everything that an Admin can do.

Customers with Enterprise Features can also configure datasource and feature-level RBAC (Role Based Access Control). Under Settings > Access Tokens, you can create and manage service tokens that can be used for RBAC. On the datasource level, you can restrict a token to only access data sources with matching tags to resolve features. On the feature level, you can restrict a token’s access to tagged features either by blocking the token from returning tagged features in any queries but allowing the feature values to be used in the computation of other features, or by blocking the token from accessing tagged features entirely.

datasource and feature level rbac