Monitoring

Monitoring & Observability: The Hospital Heartbeat Monitor

Updated June 2026
Grafana Metrics Telemetry Dashboard
Observability tools provide real-time metrics, logs, and traces to keep your server cluster healthy.

Hello, future SREs and DevOps Engineers! Today we are looking at the final piece of the DevOps puzzle: Monitoring & Observability. Once you have built servers using Terraform, containerized apps using Docker, and automated releases using CI/CD pipelines, how do you verify everything works properly? How do you know if a server is out of disk space, or if users are seeing white screens and error codes? Monitoring is the answer!

Let's learn how observability acts as a hospital heartbeat monitor for your server cluster.

The Hospital Heartbeat Monitor Metaphor

Imagine a patient recovering in a hospital's ICU. The doctors and nurses cannot sit next to the bed 24/7 checking their pulse manually. Instead, the patient is connected to an ICU monitor screen. This monitor constantly tracks their heart rate, breathing, and blood oxygen levels. If the heart rate spikes past 120 or drops to 40, the monitor rings a loud alarm so doctors can rush in and save the patient.

Monitoring & Observability is that ICU screen for server code. Your cloud servers run thousands of processes. Monitoring tools track their resource levels and sound an alarm (alerting) if CPU usage spikes, RAM runs out, or the website crashes.

The Three Pillars: Metrics, Logs, and Traces

To inspect a system's health, engineers rely on three distinct sources of information, called the Three Pillars of Observability:

1. Metrics (The Pulse)

Metrics are numbers tracking performance over time. Think of it as the patient's heart rate. Metrics tell you what is happening.

Examples in DevOps: CPU utilization %, memory usage MB, network packet traffic, or the number of active visitors currently loading the site.

2. Logs (The Doctor's Diary)

Logs are text records of events. Think of it as the doctor writing notes: "12:00 PM: Patient drank water. 12:15 PM: Patient took medicine." Logs tell you why an error happened.

Examples in DevOps: "12:00:01 - User login failed - Incorrect password" or "12:00:05 - Connection to Database timed out".

3. Traces (The X-Ray)

A Trace tracks a single request as it travels through a complex network of microservices. Think of it as a barium swallow test where doctors track food moving from the mouth, to the stomach, to the intestines. Traces help locate where a request slowed down.

Examples in DevOps: Tracking a user's checkout click as it travels from the Frontend -> API Gateway -> Payment Processor -> Database -> back to Frontend, checking latency at each step.

Real-World Scenario: Debugging Checkout Slowdown

During a flash sale, multiple users report that clicking the "Buy Ticket" button is taking 12 seconds instead of the normal 0.5 seconds. The engineers look at Grafana metrics and see a CPU spike. But which component is causing it?

To fix this:

  1. They check Jaeger Traces for the slow checkout requests and locate a long block indicating that the DB connection took 11.5 seconds.
  2. They open the Database Logs for that exact millisecond range and find a log stating: "Slow Query: SELECT * FROM seat_availability WHERE reservation_id = 9988".
  3. They add a database index to speed up seat lookups, the execution latency drops back to 0.1 seconds, and tickets process instantly!

Core Observability Tools

Prometheus

An open-source metrics database. It periodically scrapes and collects metric numbers from your servers and Kubernetes clusters.

Grafana

The visual dashboard tool. It connects to Prometheus and renders beautiful, real-time graphs, dials, and maps of server health.

Elasticsearch / Loki

Log aggregators. They index billions of text lines of logs from all servers so you can search them instantly like Google.

Jaeger / Zipkin

Tracing tools. They visualize the path and timing of user requests across multiple container services to spot latency bottlenecks.

Pro-Tip: Avoid Alert Fatigue

Only set alerts for issues that require human action! If a server CPU spikes to 99% for 1 minute due to an automated clean-up job, don't wake up engineers. Only trigger alarms if the CPU stays at 99% for over 10 minutes and user page load times are affected.

You Have Conquered the DevOps Foundations!

Congratulations! By mastering AWS, Linux, Git, Docker, Kubernetes, Terraform, CI/CD, and Monitoring, you have acquired the foundational toolkit of a DevOps Engineer. You are now ready to automate pipelines, secure architectures, and scale apps worldwide. Keep practicing, and happy engineering!

Test Your Knowledge

Answer these 25 questions to check your understanding of this module. Click on an option to reveal the correct answer instantly.

Question 1 of 25
What are the three pillars of observability?
A. Metrics, Logs, and Traces
B. Servers, Networks, and Databases
C. Alerts, CPU, and RAM
D. Dashboards, Codes, and Pipelines
Explanation: The three pillars of observability are metrics (numeric telemetry), logs (event records), and traces (request journeys).
Question 2 of 25
Which observability pillar represents numeric telemetry measured over time?
A. Logs
B. Traces
C. Metrics
D. Alerts
Explanation: Metrics are numeric values that track performance indicators like CPU usage, memory, and network throughput over time.
Question 3 of 25
Which tool is most commonly used to visualize Prometheus metrics in real-time graphs?
A. Grafana
B. Jaeger
C. Elasticsearch
D. Git
Explanation: Grafana is the industry-standard visualization tool that integrates with data sources like Prometheus to display dashboards.
Question 4 of 25
What is a 'Trace' in microservices monitoring?
A. A log file of database queries
B. A measurement of CPU heat
C. The end-to-end journey of a single request across multiple services
D. A list of open ports
Explanation: A trace tracks the path and timing of a single transaction/request as it propagates through a distributed system of microservices.
Question 5 of 25
Which tool is an open-source system monitoring and alerting toolkit that pulls/scrapes metrics?
A. Elasticsearch
B. Prometheus
C. Jenkins
D. Terraform
Explanation: Prometheus scrapes metric endpoints from systems and stores them in a time-series database.
Question 6 of 25
What is 'Alert Fatigue'?
A. When servers crash from too many alerts
B. When engineers ignore alerts because too many non-critical alarms are triggered
C. When monitoring tools run out of disk space
D. When a dashboard loads too slowly
Explanation: Alert fatigue occurs when users are overwhelmed by a high volume of frequent, low-priority alerts, leading to critical issues being missed.
Question 7 of 25
In HTTP status monitoring, which status code range represents server-side errors?
A. 2xx
B. 3xx
C. 4xx
D. 5xx
Explanation: 5xx status codes (like 500 Internal Server Error, 502 Bad Gateway) represent errors on the server side.
Question 8 of 25
Which logging level is typically used for critical errors that cause application crashes?
A. DEBUG
B. INFO
C. FATAL / ERROR
D. WARN
Explanation: FATAL or ERROR logs indicate severe failures that disrupt normal application operations.
Question 9 of 25
What does APM stand for in observability?
A. Application Port Manager
B. Automated Process Monitor
C. Application Performance Monitoring
D. Advanced Pipeline Metrics
Explanation: APM stands for Application Performance Monitoring, which tracks transaction response times, code exceptions, and dependencies.
Question 10 of 25
Which tool is commonly used to collect and query application logs?
A. Jaeger
B. Elasticsearch / Loki
C. Prometheus
D. Webpack
Explanation: Elasticsearch and Loki are popular log aggregation engines used to search and index massive log files.
Question 11 of 25
What is the main difference between Monitoring and Observability?
A. Monitoring tells you when something is wrong; Observability helps you understand why
B. Monitoring is free; Observability is paid
C. Monitoring uses logs; Observability uses databases
D. There is no difference
Explanation: Monitoring tracks known failure modes ('is the system up?'), while Observability provides deep system insights to troubleshoot new or unknown problems.
Question 12 of 25
What does 'Scraping' mean in Prometheus?
A. Deleting old metrics
B. Extracting files from Docker
C. Fetching metrics from target HTTP endpoints periodically
D. Restaring crashed pods
Explanation: Prometheus gathers metrics by sending HTTP requests to configured target endpoints (scraping them).
Question 13 of 25
Which of the following is an example of a 'Trace' visualization tool?
A. Jaeger
B. Ansible
C. Nginx
D. Docker Hub
Explanation: Jaeger is a popular open-source tool for distributed tracing, helping visualize the call graph of microservice requests.
Question 14 of 25
What does 'SLO' stand for in site reliability engineering?
A. Service Level Option
B. Service Level Objective
C. System Log Operator
D. Server Load Optimizer
Explanation: SLO stands for Service Level Objective, which is a target reliability level for a service (e.g., 99.9% uptime).
Question 15 of 25
What metric measures the percentage of time a service is operational and reachable?
A. Latency
B. Throughput
C. Availability / Uptime
D. Saturation
Explanation: Availability (or uptime) tracks the percentage of time a system is fully functional and serving requests.
Question 16 of 25
What is a 'SLA' in service agreements?
A. Service Level Agreement
B. Service Log Analyzer
C. System Load Alert
D. Scheduled Latency Assessment
Explanation: An SLA (Service Level Agreement) is a commitment between a service provider and a client regarding service reliability and performance, often with financial penalties if missed.
Question 17 of 25
What is 'PromQL'?
A. A programming language for Prometheus servers
B. A query language used to select and aggregate Prometheus time-series data
C. A protocol for sending logs
D. A database server
Explanation: PromQL (Prometheus Query Language) is the proprietary query language used to retrieve and process metrics data stored in Prometheus.
Question 18 of 25
What does 'Golden Signals' of monitoring typically include?
A. CPU, RAM, Disk, and Network
B. Latency, Traffic, Errors, and Saturation
C. Logs, Traces, Metrics, and Alarms
D. Users, Sales, Clicks, and Revenue
Explanation: Google SRE book defines the four Golden Signals of monitoring as Latency, Traffic, Errors, and Saturation.
Question 19 of 25
Which tool is commonly used to collect and route logs/metrics as a daemon (agent) on a host?
A. Fluentd / Promtail
B. Jenkins
C. Webpack
D. Git
Explanation: Fluentd, Fluent Bit, and Logstash/Promtail are log shippers or agents that run on hosts to collect and route logs.
Question 20 of 25
What is 'Synthetic Monitoring'?
A. Monitoring artificial intelligence systems
B. Simulating user transactions and paths periodically to test system availability
C. Viewing hardware telemetry
D. Disabling all alerts
Explanation: Synthetic monitoring uses simulated queries or user paths (pings, automated scripts) to proactively test if applications are reachable.
Question 21 of 25
What does 'Real User Monitoring' (RUM) track?
A. Telemetry from actual user interactions inside their web browsers
B. The CPU temp of the server
C. The count of servers running in AWS
D. The cost of cloud hosting
Explanation: RUM (Real User Monitoring) captures and analyzes every transaction of real users on a website or app, tracking client-side performance.
Question 22 of 25
In trace telemetry, what is a 'Span'?
A. The total memory of a container
B. A single unit of work (e.g., an individual database query or HTTP call) within a trace
C. The distance between server racks
D. A database table index
Explanation: A trace is made of multiple 'Spans'. Each span represents a single operation/unit of work with a start time, duration, and metadata.
Question 23 of 25
What is the standard format used by Prometheus to expose metric data?
A. XML
B. OpenTelemetry / Prometheus text-based line format
C. JSON
D. YAML
Explanation: Prometheus reads metrics exposed as plain text lines containing metric names, labels, and float64 values (OpenTelemetry standard).
Question 24 of 25
Which alert severity level requires immediate intervention to prevent system downtime?
A. INFO
B. WARNING
C. CRITICAL / PAGER
D. NOTICE
Explanation: CRITICAL or PAGER alerts indicate severe failures (like a database crash) that require instant response from on-call engineers.
Question 25 of 25
What does 'System Saturation' measure?
A. The humidity in the server room
B. How full your system resources are (e.g., CPU queue length, disk capacity)
C. The number of active git branches
D. The security strength of passwords
Explanation: Saturation measures how close a resource is to being full or overloaded (e.g., memory limits or disk utilization).