Key AWS EBS Metrics to Monitor

Most applications are backed by computing and storage, be it on-premises or cloud. When it comes to cloud providers such as AWS, there are a variety of storage options available. A popular one is AWS Elastic Block Storage (EBS), a high-performance and scalable managed block storage service attached to EC2 (Elastic Compute Service).

Block storage is a raw unformatted block device that can be mounted in the instances. The performance of the application depends on compute and storage performance; thus collecting metrics and monitoring becomes crucial for debugging performance issues in the stack.

In this article, we will look at various types of EBS available and key metrics to monitor. Fortunately, AWS collects all the important metrics to monitor the health of volumes through CloudWatch.

AWS EBS

Before delving into the metrics, we should first learn more about EBS. As discussed above, these are unformatted devices attached to EC2 instances, which are formatted and mounted to form a file system where you can persist files.

EBS volumes are created and available in a specific availability zone. Previous generation volumes use traditional magnetic disks for storage, while the new generation offers solid-state drives (SSD), which perform better for random reads and writes.

There are four types of volumes available:

General Purpose SSD volumes (gp2 and gp3) provide a good price-to-performance ratio and are generally suitable for transactional workloads. You can use these disks for use cases that involve boot volumes, single instance databases, and dev/test environments.
Provisioned IOPS SSD volumes (io1 and io2) are ideal for use cases where you need to cater to I/O-intensive workloads sensitive to performance and consistency. The IOPS provided by this type of volume is consistent.
Throughput Optimized HDD volumes (st1) use traditional, low-cost magnetic storage. Their performance is defined according to throughput (instead of IOPS), and they’re a better fit for large sequential workloads like ETLs or log processing.
Cold HDD volumes (sc1) are similar to Throughput Optimized HDD volumes. They’re suitable for infrequent data access, as they are the cheapest of the four options.

The role of CloudWatch

Amazon CloudWatch is a managed service that allows developers to monitor various resources and applications running in the AWS Cloud.

CloudWatch lets you collect and track metrics, set alarms, and react to changes in AWS resources. CloudWatch can highlight system-wide resource utilization, operational health, and performance issues.

CloudWatch provides several pre-defined metrics for multiple AWS resources out of the box. To monitor the performance of EBS volumes, use the CloudWatch metrics discussed in the next section.

Key metrics

Several metrics are exposed through CloudWatch. Most importantly, we need to observe the I/O characteristics of the disk. AWS disks come with various performance options, and their performance depends on different factors such as VM sizing, operating system, and kernel settings.

IOPS

Input/output operations per second (IOPS) is, as its name suggests, is a unit used to measure the number of I/O operations that a storage device can perform in a second. These I/O operations are typically measured in KiB, and the maximum amount of data that can be processed as a single I/O depends on the type of drive being used. For instance, the I/O size for SSD volumes is capped at 256 KiB and for HDD volumes at 1,024 KiB. This is because SSD volumes are more efficient at handling small or random I/O operations than HDD volumes.

EBS will combine small, physical sequential I/O operations into one large I/O operation, up to the maximum I/O size to optimize performance. Likewise, EBS will try to split I/O operations larger than the maximum size into smaller I/O operations.

Volume type	Maximum I/O size	I/O operations from your application	Number of IOPS	Notes
SSD	256 KiB	1 x 1024 KiB I/O operation	4	EBS splits the 1,024 I/O operations into 256 KiB chunks
		8 x sequential 32 KiB I/O operations	256	EBS combines 8 32 KiB I/O operations into a single 256 KiB operation.
		8 random 32 KiB I/O operations	8	Each random I/O operation is counted separately.
HDD	1,024 KiB	1 x 1024 KiB I/O operation	1	Single I/O operation
		8 x sequential 128 KiB I/O operations	1	EBS combines the 8 sequential 128 KiB I/O operations into a single 1,024 KiB I/O operation.
		8 random 32 KiB I/O operations	8	Each random I/O operation is counted separately.

Two metrics in CloudWatch denote IOPS: VolumeReadOps and VolumeWriteOps. VolumeReadOps represents the number of read operations in a specified time over the number of seconds in the period while VolumeWriteOps provides the number of write operations in a specified time over the number of seconds.

Volume queue length and latency

Volume queue length measures the number of I/O requests waiting to be processed by a storage device. Latency, on the other hand, is the time it takes for an I/O operation (such as a read or write request) to be completed when it is sent to the storage device until an acknowledgment is received. To avoid causing bottlenecks in the system or on the network connection to the storage device, it is crucial to carefully balance the queue length with the I/O size and latency.

The optimal queue length for a workload depends on the application's sensitivity to IOPS and latency. For example, transaction-intensive applications are more sensitive to increased I/O latency. They tend to perform better with SSD-backed volumes, which can maintain high IOPS while maintaining low latency by keeping the queue length low and the number of available IOPS high.

Meanwhile, applications that focus more on throughput tend to be less affected by increased I/O latency and perform better with HDD-backed volumes. These volumes can sustain high throughput by keeping a high queue length while performing large, sequential I/O operations. If the workload is not generating sufficient I/O requests to fully utilize the performance of the storage volume, it is possible it won’t deliver the provisioned IOPS or throughput.

In CloudWatch metrics, you can find this metric as VolumeQueueLength, and it represents the count of I/O requests waiting to be processed.

Read/write bandwidth

Two metrics in CloudWatch, VolumeReadBytes and VolumeWriteBytes, provide the sum of total bytes transferred from the device during the specified period. It is calculated in bytes. Use the following formula to derive bandwidth:

Read Bandwidth (KiB/s): sum(VolumeReadBytes) / Period / 1024
Write Bandwidth (KiB/s): sum(VolumeWriteBytes) / Period / 1024

Burst balance credit

This information pertains to the usage of I/O credits or throughput credits on General Purpose SSD (gp2), Throughput Optimized HDD (st1), and Cold HDD (sc1) volumes. Burst balance credit provides the percentage of credits remaining in the burst bucket, which is a reserve of credits that can be used to handle sudden increases in workload.

When the burst bucket is depleted, the volume's performance may be limited until credits are replenished. The type of credits used (I/O or throughput) depends on the volume type. BurstBalance in percentage is the CloudWatch metric that keeps track of credits.

Conclusion

This article has focused on the importance of monitoring as well as the five most important metrics to monitor the performance of EBS volumes: VolumeReadOps, VolumeWriteOps, VolumeReadBytes, VolumeWriteBytes, VolumeQueueLength, and BurstBalance.

It is important to baseline application performance under different loads and then make sense of these values in your context. It is recommended to set the monitoring threshold and alarms in accordance with your application’s performance requirements.

Was this article helpful?

Sorry to hear that. Let us know how we can improve the article.

Key metrics in AWS EBS monitoring