Monitoring

Monitoring & Observability: The Hospital Heartbeat Monitor

Updated June 2026

Observability tools provide real-time metrics, logs, and traces to keep your server cluster healthy.

Hello, future SREs and DevOps Engineers! Today we are looking at the final piece of the DevOps puzzle: Monitoring & Observability. Once you have built servers using Terraform, containerized apps using Docker, and automated releases using CI/CD pipelines, how do you verify everything works properly? How do you know if a server is out of disk space, or if users are seeing white screens and error codes? Monitoring is the answer!

Let's learn how observability acts as a hospital heartbeat monitor for your server cluster.

The Hospital Heartbeat Monitor Metaphor

Imagine a patient recovering in a hospital's ICU. The doctors and nurses cannot sit next to the bed 24/7 checking their pulse manually. Instead, the patient is connected to an ICU monitor screen. This monitor constantly tracks their heart rate, breathing, and blood oxygen levels. If the heart rate spikes past 120 or drops to 40, the monitor rings a loud alarm so doctors can rush in and save the patient.

Monitoring & Observability is that ICU screen for server code. Your cloud servers run thousands of processes. Monitoring tools track their resource levels and sound an alarm (alerting) if CPU usage spikes, RAM runs out, or the website crashes.

The Three Pillars: Metrics, Logs, and Traces

To inspect a system's health, engineers rely on three distinct sources of information, called the Three Pillars of Observability:

1. Metrics (The Pulse)

Metrics are numbers tracking performance over time. Think of it as the patient's heart rate. Metrics tell you what is happening.

Examples in DevOps: CPU utilization %, memory usage MB, network packet traffic, or the number of active visitors currently loading the site.

2. Logs (The Doctor's Diary)

Logs are text records of events. Think of it as the doctor writing notes: "12:00 PM: Patient drank water. 12:15 PM: Patient took medicine." Logs tell you why an error happened.

Examples in DevOps: "12:00:01 - User login failed - Incorrect password" or "12:00:05 - Connection to Database timed out".

3. Traces (The X-Ray)

A Trace tracks a single request as it travels through a complex network of microservices. Think of it as a barium swallow test where doctors track food moving from the mouth, to the stomach, to the intestines. Traces help locate where a request slowed down.

Examples in DevOps: Tracking a user's checkout click as it travels from the Frontend -> API Gateway -> Payment Processor -> Database -> back to Frontend, checking latency at each step.

Real-World Scenario: Debugging Checkout Slowdown

During a flash sale, multiple users report that clicking the "Buy Ticket" button is taking 12 seconds instead of the normal 0.5 seconds. The engineers look at Grafana metrics and see a CPU spike. But which component is causing it?

To fix this:

They check Jaeger Traces for the slow checkout requests and locate a long block indicating that the DB connection took 11.5 seconds.
They open the Database Logs for that exact millisecond range and find a log stating: "Slow Query: SELECT * FROM seat_availability WHERE reservation_id = 9988".
They add a database index to speed up seat lookups, the execution latency drops back to 0.1 seconds, and tickets process instantly!

Core Observability Tools

Prometheus

An open-source metrics database. It periodically scrapes and collects metric numbers from your servers and Kubernetes clusters.

Grafana

The visual dashboard tool. It connects to Prometheus and renders beautiful, real-time graphs, dials, and maps of server health.

Elasticsearch / Loki

Log aggregators. They index billions of text lines of logs from all servers so you can search them instantly like Google.

Jaeger / Zipkin

Tracing tools. They visualize the path and timing of user requests across multiple container services to spot latency bottlenecks.

Pro-Tip: Avoid Alert Fatigue

Only set alerts for issues that require human action! If a server CPU spikes to 99% for 1 minute due to an automated clean-up job, don't wake up engineers. Only trigger alarms if the CPU stays at 99% for over 10 minutes and user page load times are affected.

You Have Conquered the DevOps Foundations!

Congratulations! By mastering AWS, Linux, Git, Docker, Kubernetes, Terraform, CI/CD, and Monitoring, you have acquired the foundational toolkit of a DevOps Engineer. You are now ready to automate pipelines, secure architectures, and scale apps worldwide. Keep practicing, and happy engineering!

Test Your Knowledge

Answer these 35 questions to check your understanding of this module. Click on an option to reveal the correct answer instantly.

Question 1 of 35

What are the three pillars of observability?

A. Metrics, Logs, and Traces

B. Servers, Networks, and Databases

C. Alerts, CPU, and RAM

D. Dashboards, Codes, and Pipelines

Explanation: The three pillars of observability are metrics (numeric telemetry), logs (event records), and traces (request journeys).

Question 2 of 35

Which observability pillar represents numeric telemetry measured over time?

A. Logs

B. Traces

C. Metrics

D. Alerts

Explanation: Metrics are numeric values that track performance indicators like CPU usage, memory, and network throughput over time.

Question 3 of 35

Which tool is most commonly used to visualize Prometheus metrics in real-time graphs?

A. Grafana

B. Jaeger

C. Elasticsearch

D. Git

Explanation: Grafana is the industry-standard visualization tool that integrates with data sources like Prometheus to display dashboards.

Question 4 of 35

What is a 'Trace' in microservices monitoring?

A. A log file of database queries

B. A measurement of CPU heat

C. The end-to-end journey of a single request across multiple services

D. A list of open ports

Explanation: A trace tracks the path and timing of a single transaction/request as it propagates through a distributed system of microservices.

Question 5 of 35

Which tool is an open-source system monitoring and alerting toolkit that pulls/scrapes metrics?

A. Elasticsearch

B. Prometheus

C. Jenkins

D. Terraform

Explanation: Prometheus scrapes metric endpoints from systems and stores them in a time-series database.

Question 6 of 35

What is 'Alert Fatigue'?

A. When servers crash from too many alerts

B. When engineers ignore alerts because too many non-critical alarms are triggered

C. When monitoring tools run out of disk space

D. When a dashboard loads too slowly

Explanation: Alert fatigue occurs when users are overwhelmed by a high volume of frequent, low-priority alerts, leading to critical issues being missed.

Question 7 of 35

In HTTP status monitoring, which status code range represents server-side errors?

A. 2xx

B. 3xx

C. 4xx

D. 5xx

Explanation: 5xx status codes (like 500 Internal Server Error, 502 Bad Gateway) represent errors on the server side.

Question 8 of 35

Which logging level is typically used for critical errors that cause application crashes?

A. DEBUG

B. INFO

C. FATAL / ERROR

D. WARN

Explanation: FATAL or ERROR logs indicate severe failures that disrupt normal application operations.

Question 9 of 35

What does APM stand for in observability?

A. Application Port Manager

B. Automated Process Monitor

C. Application Performance Monitoring

D. Advanced Pipeline Metrics

Explanation: APM stands for Application Performance Monitoring, which tracks transaction response times, code exceptions, and dependencies.

Question 10 of 35

Which tool is commonly used to collect and query application logs?

A. Jaeger

B. Elasticsearch / Loki

C. Prometheus

D. Webpack

Explanation: Elasticsearch and Loki are popular log aggregation engines used to search and index massive log files.

Question 11 of 35

What is the main difference between Monitoring and Observability?

A. Monitoring tells you when something is wrong; Observability helps you understand why

B. Monitoring is free; Observability is paid

C. Monitoring uses logs; Observability uses databases

D. There is no difference

Explanation: Monitoring tracks known failure modes ('is the system up?'), while Observability provides deep system insights to troubleshoot new or unknown problems.

Question 12 of 35

What does 'Scraping' mean in Prometheus?

A. Deleting old metrics

B. Extracting files from Docker

C. Fetching metrics from target HTTP endpoints periodically

D. Restaring crashed pods

Explanation: Prometheus gathers metrics by sending HTTP requests to configured target endpoints (scraping them).

Question 13 of 35

Which of the following is an example of a 'Trace' visualization tool?

A. Jaeger

B. Ansible

C. Nginx

D. Docker Hub

Explanation: Jaeger is a popular open-source tool for distributed tracing, helping visualize the call graph of microservice requests.

Question 14 of 35

What does 'SLO' stand for in site reliability engineering?

A. Service Level Option

B. Service Level Objective

C. System Log Operator

D. Server Load Optimizer

Explanation: SLO stands for Service Level Objective, which is a target reliability level for a service (e.g., 99.9% uptime).

Question 15 of 35

What metric measures the percentage of time a service is operational and reachable?

A. Latency

B. Throughput

C. Availability / Uptime

D. Saturation

Explanation: Availability (or uptime) tracks the percentage of time a system is fully functional and serving requests.

Question 16 of 35

What is a 'SLA' in service agreements?

A. Service Level Agreement

B. Service Log Analyzer

C. System Load Alert

D. Scheduled Latency Assessment

Explanation: An SLA (Service Level Agreement) is a commitment between a service provider and a client regarding service reliability and performance, often with financial penalties if missed.

Question 17 of 35

What is 'PromQL'?

A. A programming language for Prometheus servers

B. A query language used to select and aggregate Prometheus time-series data

C. A protocol for sending logs

D. A database server

Explanation: PromQL (Prometheus Query Language) is the proprietary query language used to retrieve and process metrics data stored in Prometheus.

Question 18 of 35

What does 'Golden Signals' of monitoring typically include?

A. CPU, RAM, Disk, and Network

B. Latency, Traffic, Errors, and Saturation

C. Logs, Traces, Metrics, and Alarms

D. Users, Sales, Clicks, and Revenue

Explanation: Google SRE book defines the four Golden Signals of monitoring as Latency, Traffic, Errors, and Saturation.

Question 19 of 35

Which tool is commonly used to collect and route logs/metrics as a daemon (agent) on a host?

A. Fluentd / Promtail

B. Jenkins

C. Webpack

D. Git

Explanation: Fluentd, Fluent Bit, and Logstash/Promtail are log shippers or agents that run on hosts to collect and route logs.

Question 20 of 35

What is 'Synthetic Monitoring'?

A. Monitoring artificial intelligence systems

B. Simulating user transactions and paths periodically to test system availability

C. Viewing hardware telemetry

D. Disabling all alerts

Explanation: Synthetic monitoring uses simulated queries or user paths (pings, automated scripts) to proactively test if applications are reachable.

Question 21 of 35

What does 'Real User Monitoring' (RUM) track?

A. Telemetry from actual user interactions inside their web browsers

B. The CPU temp of the server

C. The count of servers running in AWS

D. The cost of cloud hosting

Explanation: RUM (Real User Monitoring) captures and analyzes every transaction of real users on a website or app, tracking client-side performance.

Question 22 of 35

In trace telemetry, what is a 'Span'?

A. The total memory of a container

B. A single unit of work (e.g., an individual database query or HTTP call) within a trace

C. The distance between server racks

D. A database table index

Explanation: A trace is made of multiple 'Spans'. Each span represents a single operation/unit of work with a start time, duration, and metadata.

Question 23 of 35

What is the standard format used by Prometheus to expose metric data?

A. XML

B. OpenTelemetry / Prometheus text-based line format

C. JSON

D. YAML

Explanation: Prometheus reads metrics exposed as plain text lines containing metric names, labels, and float64 values (OpenTelemetry standard).

Question 24 of 35

Which alert severity level requires immediate intervention to prevent system downtime?

A. INFO

B. WARNING

C. CRITICAL / PAGER

D. NOTICE

Explanation: CRITICAL or PAGER alerts indicate severe failures (like a database crash) that require instant response from on-call engineers.

Question 25 of 35

What does 'System Saturation' measure?

A. The humidity in the server room

B. How full your system resources are (e.g., CPU queue length, disk capacity)

C. The number of active git branches

D. The security strength of passwords

Explanation: Saturation measures how close a resource is to being full or overloaded (e.g., memory limits or disk utilization).

Question 26 of 35

What are the "Three Pillars of Observability"?

A. Users, Servers, Networks

B. Metrics (aggregation), Logs (discrete events), and Traces (request flow).

C. Prometheus, Grafana, Jaeger

D. Latency, Throughput, Errors

Explanation: Observability relies on Metrics (system stats), Logs (event text records), and Traces (end-to-end request journeys).

Question 27 of 35

In Prometheus, what is the role of the Pushgateway?

A. It acts as a cache for dashboards.

B. It allows short-lived batch jobs to push metrics, which Prometheus then pulls.

C. It routes alert notifications to Slack.

D. It is used to capture syslog log files.

Explanation: Prometheus is pull-based. Ephemeral jobs (like batch scripts) cannot be scraped, so they push metrics to Pushgateway for caching.

Question 28 of 35

Why must the "rate()" function in PromQL only be applied to Counter metrics?

A. Counters only increase; applying rate() to Gauges yields incorrect trends if values drop.

B. Gauges do not support time-series queries.

C. Counters measure average percentages.

D. PromQL returns syntax errors otherwise.

Explanation: rate() calculates per-second increases of counters. Gauges go up and down (like memory usage), making rate calculations logically incorrect.

Question 29 of 35

What is the advantage of using Alertmanager Silencing over disabling alerts?

A. Silencing deletes the alert records.

B. Silencing temporarily mutes notifications during known maintenance windows without altering config code.

C. Silencing routes notifications to secondary runners.

D. Silencing resolves the alerts in Slack.

Explanation: Silences match labels and mute notifications for a set duration, bypassing configuration edits for schedules.

Question 30 of 35

How does Grafana Loki minimize storage overhead compared to Elasticsearch?

A. It stores logs inside the browser's cookies.

B. It only indexes metadata labels instead of full log text content.

C. It compresses logs using custom gzip rules.

D. It deletes logs older than 24 hours automatically.

Explanation: Loki indexes labels (e.g. env, app, container), leaving raw logs compressed. Elasticsearch full-text index structures are much larger.

Question 31 of 35

In distributed tracing, what is Context Propagation?

A. Re-rendering dashboard charts globally.

B. Passing tracing metadata (like trace ID) across HTTP header networks to link spans together.

C. Compressing trace logs inside storage DBs.

D. The configuration of alerting routes.

Explanation: Context propagation injects trace IDs into headers (like W3C Trace Context) to trace a request across backend microservices.

Question 32 of 35

What are the "Four Golden Signals" of site reliability monitoring?

A. Latency, Traffic, Errors, and Saturation.

B. Memory, CPU, Disk, Network.

C. Metrics, Logs, Traces, Alerts.

D. Inbound, Outbound, Cache, Queues.

Explanation: The SRE handbook defines Latency (response time), Traffic (workload), Errors (failure rates), and Saturation (system bottleneck status).

Question 33 of 35

How do SLA, SLO, and SLI differ from each other?

A. They are exact synonyms for server metrics.

B. SLI is the metric; SLO is the target; SLA is the contract agreement defining consequences.

C. SLA is internal; SLO is customer-facing.

D. SLI is configured in Prometheus, SLO in Grafana, and SLA in Slack.

Explanation: Service Level Indicator (SLI) measures current values. Service Level Objective (SLO) sets the target. Service Level Agreement (SLA) is the legal contract.

Question 34 of 35

What is an Error Budget in SRE?

A. The financial cost of server outages.

B. The allowable unreliability threshold (100% - SLO) used to balance innovation against stability.

C. The number of alerts routed to engineers.

D. The storage allocated for system logs.

Explanation: Error budgets measure the maximum allowed failures. Burning the budget halts feature releases in favor of reliability fixes.

Question 35 of 35

What is the distinction between Black-box and White-box monitoring?

A. Black-box checks external system responses (e.g. ping); White-box checks internal metrics (e.g. CPU, database profiles).

B. Black-box monitoring is legacy; White-box monitoring is cloud-native.

C. Black-box handles traces; White-box handles metrics logs.

D. They use different encryption protocols on scraping.

Explanation: Black-box checks system behavior from the outside. White-box tracks internal metrics exposed by the system application context.

Real-Time Interview Questions & Answers

1. What is the difference between Monitoring and Observability?

Answer: Monitoring tells you that something is failing by tracking predefined metrics (like CPU or error count). Observability lets you investigate why something is failing by analyzing inputs like metrics, logs, and traces.

Example: “Monitoring alerts us when API response time exceeds 2 seconds; Observability lets us trace request parameters to find the database query causing the bottleneck.”

2. What are SLIs, SLOs, and SLAs, and how do they differ?

Answer: An SLI (Service Level Indicator) is a metric measuring service performance (e.g., error rate). An SLO (Service Level Objective) is the target value for that indicator (e.g., error rate < 0.1%). An SLA (Service Level Agreement) is the legal commitment to customers (e.g., 99.9% uptime or pay a penalty).

Example: “We monitor our API success rate SLI daily to ensure we meet our internal 99.9% SLO and prevent SLA breaches.”

3. How do you troubleshoot a sudden spike in application latency?

Answer: I start by looking at APM dashboards to isolate the slow endpoints. Then, I check database load, thread pool utilization, and inspect distributed traces to pinpoint the bottleneck service.

Example: “I resolved a latency issue by analyzing Jaeger traces and finding that our API service was performing repeated queries due to an N+1 bug.”

4. What is the difference between Pull-based (Prometheus) and Push-based (CloudWatch) monitoring?

Answer: Pull-based systems scrape metrics periodically from exposed endpoints on client systems. Push-based systems require clients or agents to push metrics directly to a central repository over HTTP.

Example: “We use Prometheus to pull metrics from our Kubernetes pods on port 9090, and push custom system logs to CloudWatch using the unified agent.”

5. How do you design an alert configuration to avoid Alert Fatigue?

Answer: I configure alerts based on customer impact (symptom-based) rather than cause-based. I send non-critical alerts (like disk usage at 75%) to Slack, and route critical alerts (like API error rate > 5%) to PagerDuty.

Example: “We set up paging alerts only when our API success rate drops below 99% for 5 minutes, reducing noise from short spikes.”

6. How do you configure a CloudWatch alarm to trigger Auto Scaling?

Answer: I set up a metric alarm based on Average CPU Utilization exceeding a threshold (e.g., 75%) for a set period, then link it to an Auto Scaling Group scale-out policy.

Example: “I created a CloudWatch alarm that adds 1 instance to our ASG if our average CPU remains above 80% for 5 minutes.”

7. What is a Distributed Trace, and why is it useful in a microservice environment?

Answer: A distributed trace tracks a request's journey as it hops across multiple services. It injects a unique trace ID into headers, allowing developers to see execution spans and latency breakdown.

Example: “We use OpenTelemetry to trace user checkouts, helping us identify that a latency issue was caused by a slow payment gateway call.”

8. How do you set up Grafana dashboards to display server metrics?

Answer: I configure Prometheus or CloudWatch as a data source in the Grafana settings, then write PromQL or CloudWatch queries to build panels visualizing CPU, RAM, and Disk I/O.

Example: “I imported the Node Exporter dashboard into Grafana to get instant visualization of our new worker nodes' system load.”

9. What is Root Cause Analysis (RCA), and what steps do you take during an outage?

Answer: RCA is the process of identifying the primary cause of an incident. During an outage, I check active alerts, review the centralized dashboard, isolate recent deployments, and inspect system logs.

Example: “During a DB outage, I checked the DB connections pool and resolved the root cause by killing a hung migration script.”

10. How do you troubleshoot a 'High Error Rate' alert on an API endpoint?

Answer: I check the NGINX or API Gateway logs to find the HTTP status codes. If they are 5xx, I query centralized application logs to check stack traces and search for code exceptions.

Example: “I resolved a 500 error spike by searching logs in Kibana and finding that the database server had run out of connections.”

11. What are the 'Four Golden Signals' of site reliability engineering monitoring?

Answer: The Four Golden Signals are Latency (time to serve request), Traffic (demand/requests per second), Errors (rate of failing requests), and Saturation (resource availability/exhaustion).

Example: “We build our core Kubernetes dashboards around the Four Golden Signals to keep a clear view of our cluster health.”

12. What is Log Aggregation, and why is it necessary?

Answer: Log aggregation collects logs from all servers and applications into a central repository (like Elasticsearch or Loki). This allows developers to query logs in one place instead of SSHing into individual hosts.

Example: “We use Filebeat to send container logs to Elasticsearch, making logs instantly searchable in Kibana during debugging.”

13. How do you monitor memory usage on EC2 instances when CloudWatch does not track it by default?

Answer: Memory is an operating system level metric. I install the unified CloudWatch agent on the instances, configure the `amazon-cloudwatch-agent.json` file to collect memory metrics, and start the agent.

Example: “I configured the CloudWatch agent to push memory and disk metrics to CloudWatch every 60 seconds, setting alarms on those parameters.”

14. How do you configure health checks for endpoint monitoring?

Answer: I configure external synthetic monitors (like Route 53 health checks or Pingdom) to send HTTP requests to a `/health` endpoint periodically, checking for 200 OK status codes.

Example: “We have a Route 53 health check configured to alert us via email and SMS if our landing page endpoint fails to load.”

15. What is Saturation in monitoring, and how do you calculate it?

Answer: Saturation measures how full your system resources are. It is calculated by checking resource metrics (like disk queue depth, memory swap usage, or database connection pool utilization).

Example: “We set a saturation alert on our PostgreSQL database connection limit when it reaches 85% capacity to proactively avoid connection failures.”

Live Sandbox

Don't Just Read. Code Live!

Practice what you just learned in our secure, zero-setup interactive labs. Boot up Linux containers, orchestrate AWS infrastructure, and run Docker right in your browser.

100% Free & Interactive for Growth School Community No Setup Required Real-time Terminal Feedback

Start Live Sandbox

ubuntu@growthschool:~

docker run -d -p 80:80 nginx

Unable to find image 'nginx:latest' locally...

latest: Pulling from library/nginx

Digest: sha256:4c087b3289aa6b185...

Status: Downloaded newer image for nginx:latest

Container running at http://localhost:80