The 4 Golden Signals of Monitoring

With the development of architectural strategies and technological advancements, application services have become more resilient. They are highly scalable, more quickly released, and simpler to test.

A continuous monitoring strategy is required to identify the underlying infrastructure availability and performance needed to provide a high-quality application service to our customers. So, whenever a service fails, our operations team needs to nail down the exact problem quickly to prevent service degradation and a negative impact on the business.

Our contemporary distributed systems generate hundreds of metrics, including database metrics, host metrics, application metrics, infrastructure metrics, and so on. Continuously keeping track of all these metrics is impractical, and yet, choosing which metrics to monitor is critical for achieving a highly available and reliable service.

In this article, we'll delve deeper into what "Golden Signals" are and why they're critical for monitoring our distributed software systems. In addition, we'll go over the various monitoring techniques and related tools.

What is site reliability engineering?

An IT strategy for managing highly reliable and scalable software systems is called site reliability engineering. SRE typically relies on data and ensures the stability of even the most distributed, complicated IT systems. Additionally, it uses software engineering to speed up software delivery and automate IT operations tasks while lowering risks. SRE practices are very close to the IT operations landscape, with the core ideology of making production systems more reliable. Monitoring is another indispensable component of SRE tenants—a must-have feature for any software system or application to work properly. With more insights into our software and hardware systems, the more adept they will be at satisfying our customers.

Golden Signals

These signals aid in the consistent tracking of service health across all applications and infrastructure. There are four Golden Signals—latency, traffic, errors, and saturation—that have become essential building blocks for effective monitoring. Also, these reduce the number of metrics needed to just four, providing a comprehensive view of service quality from a customer perspective.

Fig. 1: SRE’s Golden Signals

Latency

Latency is the total time it takes for a user to send a request and get a response back. For instance, if a web service communicates with a database service on the backend to verify a user, the time it takes to execute the database query is measured as part of the latency calculation.

Usually, latency is measured on the server side; however, we can also measure it on the client side. Both must be quantified, but the latter is more important to user experience. Furthermore, you can use the 95th percentile of collected latency data points to gauge how well your application is performing. Increased latency is a major sign of performance degradation. So, the lower the latency, the better the service.

Errors

The error rate defines the rate of unsuccessful requests. The operations team must keep track of both the overall system error rate and the rate of errors occurring at specific service endpoints. These errors highlight infrastructure misconfigurations, outages, flaws in our application code, or broken dependencies. For instance, a sudden spike in the error rate might represent a service failure, database failure, or network outage.

To understand the health of a service, you need to understand errors and categorize them into critical and non-critical buckets. This also enables you to act quickly and take corrective measures.

Traffic

The volume of requests and responses moving through a network is measured as traffic. It can be a different type of traffic like HTTP, FTP, etc. that hits our web server or API endpoints. Additionally, it allows our engineers to differentiate between capacity issues and troublesome errors even when there is minimal traffic.

You can monitor traffic trends to uncover capacity issues, misconfigurations, projections, etc. Monitoring traffic in your application also helps to prepare for future demand.

Saturation

System components like hardware disks, memory, and networks hit their saturation point when demand surpasses a service's capacity. In a nutshell, saturation helps you understand the overall capacity of a service through resource utilization. Whenever a system is nearing full utilization of its resources, this can result in a capacity decline. Tail latency could be the unintended consequence of a system or application-level resource restriction.

Saturation conceivably occurs for any resources required by the application, such as memory, IOPS, CPU, or DBS queries. When it comes to latency, it can be the 99th percentile of service request latency, which can act as a warning indicator. In a multi-tiered system, saturation might have a cascading effect where your upstream would wait indefinitely for the downstream service to answer or finally timeout, but it would also cause new requests to queue up, leading to resource starvation.

Unfortunately, when load balancers and other automated scaling methods are in place, it's simple to overlook saturation. However, load balancers can occasionally fail to perform their functions due to improperly configured systems, irregular scaling, and other issues. By monitoring saturation, teams can spot concerns, change their approach to addressing them, and avoid problems from recurring.

RED and USE methods

Additionally noteworthy are two supplementary techniques: RED (Rate, Error, Duration) and USE (Utilization, Saturation, Error). RED focuses solely on monitoring your services from a client perspective, regardless of their infrastructure. The USE method, meanwhile, focuses on resource usage to quickly spot common bottlenecks; this approach only uses request failures as an external sign of trouble and is unable to spot latency-based problems that can also harm your systems.

These signals are referred to as "golden" because they aim to assess variables that have a direct impact on end users and components that produce output; in other words, they provide accurate measurements of important variables.

The Golden Signals attempt to combine the finest aspects of both of these approaches. Ultimately, they all aim to streamline your complex distributed system to enhance crisis response, making them more valuable than less-accurate measurements like CPU, RAM, interface utilization, latency, and a plethora of other metrics. Furthermore, you can use these Golden Signals in several ways, for example, for alerting, anomaly detection, fine-tuning, and capacity planning.

There are numerous tools available on the market to provide better visibility into your computing systems. We will focus on the pertinent open-source and commercial off-the-shelf (COTS) monitoring tools in the following section.

Monitoring tools: Open-source vs. COTS

For businesses with a shoestring tooling budget, open-source technologies are beneficial. Because they frequently allow for total customization, you can include them in your distributed system. However, designing such tools takes time and specialized knowledge, and you are also responsible for ensuring their availability, security, and upgrades.

Prometheus and Grafana are two widely used open-source tool combinations for monitoring. Others include Checkmk, Zabbix, and Nagios.

Managed monitoring tools are more expensive than open-source alternatives but offer more robustness. They also offer expert assistance for integrating them into your distributed systems; you also no longer have to worry about their availability, security, and upgrades. Tools like Site24x7 provide an all-in-one monitoring solution for enterprise customers and service providers.

Site24x7 — A quick introduction

DevOps and IT operations may use Site24x7, a cloud-based, all-in-one monitoring solution for websites, applications, servers, and network devices. Aside from monitoring, it also assists IT teams and DevOps experts in quickly identifying and resolving issues to enhance the overall system performance and availability.

Several essential capabilities of Site24x7 can help monitor the Golden Signals:

Latency: Track the average response time of websites and identify application slowness due to code errors, database slowness, or server load using the Site24x7 observability platform and resolve them instantly. Further, ensure the performance perceived by the end user in accessing the application, and measure time to first byte and latency using synthetic and real-user monitoring (RUM).
Errors: Identify broken transactions in applications and track the exception traces using APM Insight. Track the broken workflow or API error from an end-user perspective with synthetic monitoring. Pinpoint JavaScript errors using RUM and mobile app errors via Mobile APM. Also, network monitoring will help measure the number of errors and discarded packets on network device interfaces.
Traffic: Obtain the number of requests per minute and measure the request performance using APM Insight. Track redundant traffic, such as multiple calls between databases, applications, etc., and identify the top devices or interfaces via the network Health Dashboard. Also, NetFlow monitoring can track application traffic utilization by analyzing network traffic flows, which helps to optimize network resources.
Saturation: Track capacity, for example, I/O of the CPU; memory; and disk space of the server, VMWare, and Nutanix resources. This further helps you plan and increase resources for your infrastructure.

Lastly, monitoring the Golden Signals through the Site24x7 suite can assist IT teams and DevOps specialists in keeping an eye on and improving the functionality of their servers, networks, and applications. This results in a high-quality user experience for your clients.

Conclusion

The Golden Signals are a rich and fascinating subject of study because many services still don't monitor or don't expose the proper metrics. When determining your service-level goals and selecting a monitoring plan to assure the dependability of your application, these four signals are an excellent starting point.

Many common tools still don't give us all the data we need to manage and maintain our infrastructure properly, even in the midst of the cloud and DevOps revolution. Once you've established monitoring for the Golden Signals, don't forget to use additional methods such as logging and tracing to increase observability.

Was this article helpful?

Sorry to hear that. Let us know how we can improve the article.

DevOps Monitoring Guide — How to manage the 4 Golden Signals