How to Use EBS in Databricks: A Technical Guide
December 27, 2024
Introduction
Elastic Block Store (EBS) is a key storage solution in cloud-based architectures, and it plays a crucial role in optimizing Databricks workloads. Understanding when and how to leverage EBS can significantly impact performance and cost efficiency. In this guide, we will explore what EBS is, how it integrates with Databricks, its configurable parameters, and best practices for using it effectively.
What is EBS?
Elastic Block Store (EBS) is a block storage service provided by AWS that offers high-performance storage for compute instances. Unlike ephemeral storage, which is tied to the lifecycle of an instance, EBS volumes persist beyond instance termination. This makes EBS particularly useful for workloads that require durability and scalability.
Many users find AWS EBS confusing, particularly when dealing with dynamic storage allocation and performance trade-offs. A discussion on Reddit by u/KarlData highlights common concerns about understanding EBS volume scaling, performance limitations, and billing implications. This guide aims to clarify these concepts and offer best practices tailored for Databricks users.
Key characteristics of EBS:
- Persistent storage that remains available even after instance and spot termination.
- High throughput and low-latency block-level storage.
- Various volume types optimized for performance and cost.
- Ability to scale and attach volumes dynamically to instances.
How is EBS Implemented in Databricks?
Databricks, running on AWS, provides two storage options, including instance-attached storage (ephemeral disks) and EBS volumes. In Databricks, EBS volumes are typically used for:
- Persistent storage of notebooks and logs.
- Additional storage when ephemeral disks are insufficient.
- Scaling storage independently from compute resources.
A post on Reddit by u/MLengnr shows that many users are uncertain about how EBS interacts with Databricks job clusters. Users often ask if job clusters can leverage EBS effectively and how to configure storage scaling properly. This guide will address these questions by explaining best practices for integrating EBS into Databricks job clusters.
By default, Databricks allows workers to utilize ephemeral storage, but when additional storage is needed, EBS can be mounted dynamically. The platform manages EBS provisioning behind the scenes, automatically attaching and detaching volumes as needed.
Configurable Parameters of EBS in Databricks
When using EBS in Databricks, there are several configurable parameters to optimize performance and cost:
- Volume Type – EBS offers different types such as GP3 (General Purpose SSD), IO1 (Provisioned IOPS SSD), and ST1 (Throughput Optimized HDD). Choosing the right type depends on workload needs. A detailed breakdown of these choices can be found in this Medium article by DevOpsLearning, which compares GP2 vs. GP3 and IO1 vs. IO2 for different workloads.
- Number of volumes – Each instance can have multiple volumes attached, with a maximum limit of 10. This option is available only for supported node types, while legacy node types do not support custom EBS volume configurations. For node types without an instance store, at least one EBS volume must be specified; otherwise, the cluster creation process will fail.
- Size – The allocated size of the EBS volume, typically starting from 32 GiB.
- IOPS and Throughput – IOPS (Input/Output Operations Per Second) and throughput can be fine-tuned for workloads requiring high-performance storage.
What is Databricks’ Autoscaling Local Storage Feature?
Databricks provides an autoscaling local storage feature, which automatically provisions additional EBS volumes when a job runs out of local disk space. Key aspects of this feature include:
- Automatic scaling of storage to prevent job failures due to space constraints.
- Dynamic detachment of unused EBS volumes to minimize cost.
- Seamless integration with Databricks clusters, requiring minimal manual configuration.
This feature ensures that workloads that temporarily need more storage do not fail due to disk space limitations while keeping costs optimized by releasing storage when it is no longer needed.
When to Use EBS in Databricks?
EBS should be used strategically in Databricks to optimize performance and costs. Here are some common scenarios:
- Workers without Local Disk (“d” instances): Some EC2 instance types (e.g., r6i) do not include ephemeral disks. In such cases, EBS is required for temporary or persistent storage.
- High Storage-Compute Ratio: For example, when large datasets need to be processed with limited compute.
- Checkpoints and State Persistence: Workloads that require data persistence between cluster restarts benefit from EBS.
What are Key Performance Benefits of EBS?
- Reduces Job Failures: Autoscaling or static EBS allocation ensures jobs do not fail due to storage limits.
- Improves Job Duration: Faster IOPS and throughput enable quicker data access and processing.
- Cost Reduction: EBS has potential for optimizing costs, as described below.
EBS Pricing in Databricks
EBS pricing within Databricks on AWS is determined by the type and amount of storage provisioned, with costs accruing separately from compute instances. This means storage expenses continue even when clusters are terminated, unless the storage is explicitly deleted. H Let’s review a breakdown of the primary EBS volume types and their associated costs:
- General Purpose SSD (gp3):
- Storage Cost: $0.08 per GB-month.
- Included Performance: 3,000 IOPS and 125 MB/s throughput at no additional charge.
- Additional Performance Costs:
- IOPS: $0.005 per provisioned IOPS-month beyond the included 3,000 IOPS.
- Throughput: $0.04 per provisioned MB/s-month beyond the included 125 MB/s.
- General Purpose SSD (gp2):
- Storage Cost: $0.10 per GB-month.
- Performance: IOPS are included in the storage cost, with performance scaling linearly with volume size.
- Provisioned IOPS SSD (io2):
- Storage Cost: $0.125 per GB-month.
- IOPS Cost:
- $0.065 per provisioned IOPS-month for up to 32,000 IOPS.
- $0.046 per provisioned IOPS-month for IOPS between 32,001 and 64,000.
- $0.032 per provisioned IOPS-month for IOPS exceeding 64,000.
- Throughput Optimized HDD (st1):
- Storage Cost: $0.045 per GB-month.
- Performance: Designed for frequently accessed, throughput-intensive workloads.
- Cold HDD (sc1):
- Storage Cost: $0.015 per GB-month.
- Performance: Suited for less frequently accessed workloads with lower performance requirements.
It’s important to note that these costs are based on AWS’s pricing and are subject to change. Additionally, Databricks itself does not add extra charges for EBS volumes; however, efficient management of storage resources is crucial to avoid unnecessary expenses.
How Can EBS Improve Job Costs?
One of the most significant advantages of EBS is its potential for cost optimization. You can do that by using EBS for low storage consumption workloads. Many machines can be configured with or without Local Disk storage (“d” instances). In Databricks they have the same cost, while in AWS the pricing is changing and higher for workers in disk.
Instance Type |
Cost Per Hour | Storage Type |
r6id.4xlarge |
$1.50 |
Local NVMe SSD |
r6i.4xlarge + EBS | $1.30 |
EBS GP3 |
As you can see, instances with built-in ephemeral storage (e.g., r6id) are more expensive than their EBS-reliant counterparts (e.g., r6i). Using an r6i instance with EBS storage is often cheaper than an r6id instance, without compromising performance.
Conclusion
Using EBS in Databricks can significantly improve both performance and cost efficiency. By referencing common user concerns and best practices, this guide aims to provide clarity on optimal EBS usage. If you’re looking to further optimize Databricks costs and performance, expert consultation can provide tailored strategies and automation to ensure seamless, cost-effective operations.
Would you like to explore more advanced optimizations for your Databricks workloads? Contact us today to learn how we can help reduce costs and improve performance for your data pipelines.