Optimizing Databricks Costs: The Complexities of Manual Cluster Configuration
December 10, 2024
Cost optimization in Databricks can seem deceptively straightforward – scale resources up when needed, scale them down when idle. But the reality is far more complicated. The dynamic nature of cloud infrastructure, data variability, and real-time instance pricing make manual cost management a challenge, even for the most experienced teams.
The problem becomes exponentially harder when you scale to thousands of jobs per day, each with different resource needs and SLAs. This post focuses on three concrete examples that highlight why it is so difficult to manually configure clusters for cost optimization in Databricks. These examples delve into the complexities of spot instances, the vast selection of instance types, and storage configuration decisions.
Spot Instance Pricing and Availability: A Moving Target
Spot instances offer the allure of significant cost savings compared to on-demand instances, but their ephemeral nature introduces substantial complexity when configuring clusters. Spot instances can be preempted (terminated by the cloud provider) when demand for on-demand instances spikes, and their availability and pricing fluctuate throughout the day and across availability zones.
Dynamic Pricing and Loss Probability
Let’s consider a scenario where you run a nightly ETL job that processes data for 4 hours. You decide to use spot instances to save costs, and you check that the spot instance prices are currently 70% lower than on-demand prices. However, the pricing for spot instances can vary dramatically based on market conditions. The spot market is highly dynamic, meaning that even within a few hours, the price of the instance type you are using may rise significantly, negating the cost advantage you initially planned for.
Additionally, each spot instance has a different loss probability – the likelihood that the instance will be preempted. This probability also changes over time, depending on factors like regional demand and the specific instance type. For example, an m5.large instance on weekends might have a 5% chance of being preempted, while the same instance on weekdays could have a 20% chance of being preempted. Manually tracking these price changes and probabilities over time, across multiple availability zones, is extremely difficult.
If you incorrectly configure the cluster with instances that become too expensive or are likely to be terminated, your ETL job might fail or significantly increase in cost. Without automation to handle real-time decision-making on instance types, it becomes nearly impossible to manually optimize costs when using spot instances.
Dynamic Availability Across Availability Zones
To complicate matters further, the availability of spot instances is not uniform across availability zones. For instance, in us-west-2a, there may be plenty of r5.large instances available at low prices, but in us-west-2b, there might be a shortage, driving up the price or making them unavailable. Manually selecting which availability zone to use based on spot instance pricing and availability in real-time is labor-intensive and prone to errors.
In reality, spot instance pricing and availability are so volatile that manually configuring clusters to take advantage of them without overspending or risking job failures is nearly impossible at scale.
Choosing the Right Instance Type: A Matrix of Trade-offs
Databricks runs on top of cloud infrastructure that offers a broad array of instance types, each with its own cost, performance characteristics, and storage options. Picking the right instance type involves balancing compute, memory, network performance, and storage requirements – all of which can vary from one workload to another.
Example: Compute vs. Memory Optimization
Let’s say you are processing data with Spark and your workload is heavily compute-bound—meaning the operations you are performing (e.g., joins, aggregations) require a lot of CPU power but don’t use much memory. In this case, a compute-optimized instance type like c5.2xlarge(X CPU, Y MEM, Z Disk) might be appropriate. But if your workload is memory-bound, such as processing large in-memory datasets, you might need a memory-optimized instance like r5.2xlarge (X CPU, Y MEM, Z Disk).
However, workloads often fluctuate in their demands. The same job might start as compute-heavy but become memory-intensive halfway through. If you have misconfigured your cluster to use only compute-optimized instances, you’ll run into severe memory bottlenecks, causing job failures or increased execution time – both of which drive up costs.
Choosing the right instance type manually requires a deep understanding of the workload’s characteristics, which often aren’t known ahead of time, especially for dynamic or new jobs. The wrong instance type can lead to inefficiencies and dramatically higher costs.
Example: Storage Decisions — NVMe vs. EBS
Beyond compute and memory, storage decisions play a pivotal role in cluster configuration. For instance, some instance types (like i3 instances) come with local NVMe SSD storage that is extremely fast and ideal for workloads with high I/O demands, such as shuffling large datasets in Spark. However, these instances tend to be more expensive and have limited disk capacity.
On the other hand, you can opt for instance types that don’t have local NVMe storage and instead use Elastic Block Store (EBS) volumes, which offer more flexibility in terms of capacity but may introduce latency during data access. Should you choose an instance type with an attached NVMe disk for better performance, or should you go with a more cost-efficient instance that uses EBS?
The decision is further complicated by the fact that EBS volumes have different performance tiers, including throughput-optimized options (st1 and sc1). If your workload requires high throughput, you may opt for an st1 EBS volume, but it comes at a higher price than a general-purpose gp2 volume.
Making these decisions manually for each job or workflow introduces significant complexity. Even if you get it right, it might not stay optimized as the workload changes over time.
Autoscaling Complexities: Over- or Under-provisioning Resources
Databricks provides built-in autoscaling functionality to dynamically adjust cluster size based on the workload, but even configuring autoscaling parameters manually is complex. Getting the balance right between provisioning enough resources to handle peak loads and not over-provisioning during idle times is a fine line.
Example: Sizing for Variable Workloads
Consider an hourly batch job that processes different amounts of data at different times of the day. In the early hours of the morning, data loads are light, but by midday, they peak significantly. If you set your cluster size based on the peak load, you will have over-provisioned resources during the low-traffic hours, wasting compute resources and driving up costs.
On the other hand, if you configure the cluster for the minimum workload, it will autoscale up when the data load increases. But autoscaling introduces its own overhead – scaling a cluster up can take time, during which your jobs may be delayed. If you don’t account for this, your job SLAs may be impacted.
Additionally, Spark workloads don’t always scale linearly. You might configure autoscaling to add more nodes when a cluster reaches a certain load threshold, but this doesn’t necessarily translate to improved performance. A job that runs well on 8 nodes might see diminishing returns when scaled to 16 or 32 nodes due to shuffle or data transfer bottlenecks.
Manually configuring the scaling thresholds – deciding how many nodes to add and when – is difficult without a deep understanding of the specific job and how it interacts with the cluster resources. Mistakes in autoscaling configurations can lead to both performance inefficiencies and unnecessary costs.
Conclusion: Manual Configuration is Not Scalable
As these examples illustrate, manually configuring Databricks clusters for cost optimization is a difficult, dynamic, and error-prone task. Spot instance pricing and availability fluctuate throughout the day, making it hard to know when and where to deploy instances. The vast array of instance types, each with different CPU, memory, and storage configurations, further complicates the decision-making process. Finally, autoscaling introduces additional challenges as workloads fluctuate in unpredictable ways.
In environments running thousands of jobs per day, the only viable solution is automation. Automating cluster configuration based on workload characteristics, real-time spot instance availability, and predictive autoscaling models is key to achieving both performance and cost efficiency at scale.