AWS Monitoring helps you gain observability into your AWS environment
Most applications are backed by computing and storage, be it on-premises or cloud. When it comes to cloud providers such as AWS, there are a variety of storage options available. A popular one is AWS Elastic Block Storage (EBS), a high-performance and scalable managed block storage service attached to EC2 (Elastic Compute Service).
Block storage is a raw unformatted block device that can be mounted in the instances. The performance of the application depends on compute and storage performance; thus collecting metrics and monitoring becomes crucial for debugging performance issues in the stack.
In this article, we will look at various types of EBS available and key metrics to monitor. Fortunately, AWS collects all the important metrics to monitor the health of volumes through CloudWatch.
Before delving into the metrics, we should first learn more about EBS. As discussed above, these are unformatted devices attached to EC2 instances, which are formatted and mounted to form a file system where you can persist files.
EBS volumes are created and available in a specific availability zone. Previous generation volumes use traditional magnetic disks for storage, while the new generation offers solid-state drives (SSD), which perform better for random reads and writes.
There are four types of volumes available:
Amazon CloudWatch is a managed service that allows developers to monitor various resources and applications running in the AWS Cloud.
CloudWatch lets you collect and track metrics, set alarms, and react to changes in AWS resources. CloudWatch can highlight system-wide resource utilization, operational health, and performance issues.
CloudWatch provides several pre-defined metrics for multiple AWS resources out of the box. To monitor the performance of EBS volumes, use the CloudWatch metrics discussed in the next section.
Several metrics are exposed through CloudWatch. Most importantly, we need to observe the I/O characteristics of the disk. AWS disks come with various performance options, and their performance depends on different factors such as VM sizing, operating system, and kernel settings.
Input/output operations per second (IOPS) is, as its name suggests, is a unit used to measure the number of I/O operations that a storage device can perform in a second. These I/O operations are typically measured in KiB, and the maximum amount of data that can be processed as a single I/O depends on the type of drive being used. For instance, the I/O size for SSD volumes is capped at 256 KiB and for HDD volumes at 1,024 KiB. This is because SSD volumes are more efficient at handling small or random I/O operations than HDD volumes.
EBS will combine small, physical sequential I/O operations into one large I/O operation, up to the maximum I/O size to optimize performance. Likewise, EBS will try to split I/O operations larger than the maximum size into smaller I/O operations.
Volume type | Maximum I/O size | I/O operations from your application | Number of IOPS | Notes |
---|---|---|---|---|
SSD | 256 KiB | 1 x 1024 KiB I/O operation | 4 | EBS splits the 1,024 I/O operations into 256 KiB chunks |
8 x sequential 32 KiB I/O operations | 256 | EBS combines 8 32 KiB I/O operations into a single 256 KiB operation. | ||
8 random 32 KiB I/O operations | 8 | Each random I/O operation is counted separately. | ||
HDD | 1,024 KiB | 1 x 1024 KiB I/O operation | 1 | Single I/O operation |
8 x sequential 128 KiB I/O operations | 1 | EBS combines the 8 sequential 128 KiB I/O operations into a single 1,024 KiB I/O operation. | ||
8 random 32 KiB I/O operations | 8 | Each random I/O operation is counted separately. |
Two metrics in CloudWatch denote IOPS: VolumeReadOps
and VolumeWriteOps. VolumeReadOps
represents the number of read operations in a specified time over the number of seconds in the period while VolumeWriteOps provides the number of write operations in a specified time over the number of seconds.
Volume queue length measures the number of I/O requests waiting to be processed by a storage device. Latency, on the other hand, is the time it takes for an I/O operation (such as a read or write request) to be completed when it is sent to the storage device until an acknowledgment is received. To avoid causing bottlenecks in the system or on the network connection to the storage device, it is crucial to carefully balance the queue length with the I/O size and latency.
The optimal queue length for a workload depends on the application's sensitivity to IOPS and latency. For example, transaction-intensive applications are more sensitive to increased I/O latency. They tend to perform better with SSD-backed volumes, which can maintain high IOPS while maintaining low latency by keeping the queue length low and the number of available IOPS high.
Meanwhile, applications that focus more on throughput tend to be less affected by increased I/O latency and perform better with HDD-backed volumes. These volumes can sustain high throughput by keeping a high queue length while performing large, sequential I/O operations. If the workload is not generating sufficient I/O requests to fully utilize the performance of the storage volume, it is possible it won’t deliver the provisioned IOPS or throughput.
In CloudWatch metrics, you can find this metric as VolumeQueueLength,
and it represents the count of I/O requests waiting to be processed.
Two metrics in CloudWatch, VolumeReadBytes
and VolumeWriteBytes,
provide the sum of total bytes transferred from the device during the specified period. It is calculated in bytes. Use the following formula to derive bandwidth:
Read Bandwidth (KiB/s): sum(VolumeReadBytes) / Period / 1024
Write Bandwidth (KiB/s): sum(VolumeWriteBytes) / Period / 1024
This information pertains to the usage of I/O credits or throughput credits on General Purpose SSD (gp2), Throughput Optimized HDD (st1), and Cold HDD (sc1) volumes. Burst balance credit provides the percentage of credits remaining in the burst bucket, which is a reserve of credits that can be used to handle sudden increases in workload.
When the burst bucket is depleted, the volume's performance may be limited until credits are replenished. The type of credits used (I/O or throughput) depends on the volume type. BurstBalance
in percentage is the CloudWatch metric that keeps track of credits.
This article has focused on the importance of monitoring as well as the five most important metrics to monitor the performance of EBS volumes: VolumeReadOps, VolumeWriteOps, VolumeReadBytes, VolumeWriteBytes, VolumeQueueLength,
and BurstBalance.
It is important to baseline application performance under different loads and then make sense of these values in your context. It is recommended to set the monitoring threshold and alarms in accordance with your application’s performance requirements.
Write for Site24x7 is a special writing program that supports writers who create content for Site24x7 “Learn” portal. Get paid for your writing.
Apply Now