Observability in Microservices — Prometheus, Grafana and OpenTelemetry
End-to-end observability for Spring Boot microservices: metrics with Prometheus, dashboards with Grafana, distributed tracing with OpenTelemetry, alerting and Kubernetes monitoring.
Introduction
When a single user request crosses six microservices, three databases, a message broker and a third-party API, the question is no longer *did it work?* but *where did it spend its time, and which hop failed?* Logs alone cannot answer that. Observability — the discipline of being able to ask new questions of a running system without shipping new code — is what makes complex distributed systems operable.
This tutorial builds a complete observability stack for Spring Boot microservices: metrics with Prometheus, dashboards with Grafana, distributed tracing with OpenTelemetry, logs routed to Loki, and alerting with AlertManager. You will instrument your services, scrape and store telemetry, build dashboards and SLO-driven alerts, and deploy the whole stack to Kubernetes.
Why observability matters
Three signals — metrics, logs and traces — answer the three questions that matter during an incident:
- Metrics: is the system healthy *right now*, and what is the trend?
- Logs: what exactly happened on this one request?
- Traces: where did the latency or error come from across services?
Without all three, you are debugging in the dark. The good news is that modern open-source tools (Prometheus, Grafana, OpenTelemetry, Loki, Tempo, AlertManager) integrate so well that a small team can run a Datadog-quality stack on their own infrastructure.
Architecture
Observability Architecture
Real-world use cases
- Incident response. Page on SLO burn rate, then drill from a Grafana panel into traces and logs in seconds.
- Performance regression detection. Compare p95 latency week-over-week per endpoint.
- Capacity planning. Trend CPU, memory and request rate to predict the next scale event.
- Customer support. Pull every span and log for a given
trace-idto answer "what happened to my order?" - Cost attribution. Tag spans with team and feature labels to bill internal consumers.
Architecture overview
We deploy three telemetry pipelines that share a single collector:
- Services emit OTLP to the OpenTelemetry Collector.
- The collector fans out to Prometheus (metrics), Loki/ELK (logs) and Tempo/Jaeger (traces).
- Grafana queries all three and renders unified dashboards.
- AlertManager receives alerts from Prometheus and routes them to Slack, PagerDuty or email.
Step 1 — Install Prometheus
Run Prometheus locally with Docker:
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
ports: ["9090:9090"]
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'spring-boot'
metrics_path: /actuator/prometheus
static_configs:
- targets: ['host.docker.internal:8080']
docker compose up -d prometheus
open http://localhost:9090
Step 2 — Instrument Spring Boot
Add Micrometer + the Prometheus registry:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.yml
management:
endpoints:
web:
exposure:
include: health, info, prometheus, metrics
metrics:
tags:
application: ${spring.application.name}
Visit http://localhost:8080/actuator/prometheus — you should see hundreds of built-in metrics covering JVM, HTTP, datasource and Kafka.
Add a custom metric:
@RestController
public class OrderController {
private final Counter ordersCreated;
public OrderController(MeterRegistry reg) {
this.ordersCreated = Counter.builder("orders_created_total")
.description("Orders successfully created")
.tag("source", "api")
.register(reg);
}
@PostMapping("/orders")
public ResponseEntity<?> create() {
ordersCreated.increment();
return ResponseEntity.ok().build();
}
}
Step 3 — Install Grafana
# docker-compose.yml (continued)
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
Open Grafana, add Prometheus as a data source (http://prometheus:9090), and import the official JVM (Micrometer) dashboard (ID 4701) for an instant baseline.
A custom panel for our counter:
rate(orders_created_total[5m])
A request-rate panel from the built-in HTTP metrics:
sum by (uri) (rate(http_server_requests_seconds_count[5m]))
p95 latency:
histogram_quantile(0.95,
sum by (le, uri) (rate(http_server_requests_seconds_bucket[5m])))
Step 4 — OpenTelemetry setup
OpenTelemetry standardises instrumentation, propagation, and export across languages. We will run the OpenTelemetry Collector as a sidecar and configure Spring Boot to push OTLP to it.
```yaml
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }processors: batch: {} memory_limiter: check_interval: 1s limit_mib: 512
exporters: prometheusremotewrite: endpoint: http://prometheus:9090/api/v1/write otlphttp/tempo: endpoint: http://tempo:4318 loki: endpoint: http://loki:3100/loki/api/v1/push
service: pipelines: metrics: { receivers: [otlp], processors: [batch, memory_limiter], exporters: [prometheusremotewrite] } traces: { receivers: [otlp], processors: [batch], exporters: [otlphttp/tempo] } logs: { receivers: [otlp], processors: [batch], exporters: [loki] } ```
# docker-compose.yml (continued)
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otelcol/config.yaml"]
volumes: ["./otel-collector-config.yaml:/etc/otelcol/config.yaml:ro"]
ports: ["4317:4317", "4318:4318"]
Step 5 — Spring Boot instrumentation with OpenTelemetry
Add the OTel starter:
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
<version>2.6.0</version>
</dependency>
# application.yml
otel:
service:
name: orders-service
exporter:
otlp:
endpoint: http://otel-collector:4318
traces:
sampler: parentbased_traceidratio
sampler.arg: 0.1 # 10% sampling
Spring Boot auto-instrumentation now wraps every @RestController, RestTemplate, WebClient, JDBC call, Kafka producer/consumer and scheduled task. Every span carries the same trace-id, propagated through W3C traceparent headers.
A custom span for a business operation:
```java
@Autowired Tracer tracer;public Order checkout(Cart cart) { Span span = tracer.spanBuilder("checkout").startSpan(); try (Scope s = span.makeCurrent()) { span.setAttribute("cart.size", cart.size()); return doCheckout(cart); } finally { span.end(); } } ```
Distributed tracing across services
The whole point of OpenTelemetry is that one trace-id flows from the user's request all the way through the system.
Architecture
Distributed Tracing Across Microservices
In Grafana, open Tempo, paste a trace-id from your logs, and you get a flame graph of every span with timings, attributes and links to the corresponding logs.
Step 6 — Dashboard creation
A production dashboard usually has four rows:
- RED: Requests per second, Error rate, Duration (p50/p95/p99) per service.
- USE: Utilization, Saturation, Errors for CPU, memory, disk.
- Business KPIs: orders/min, sign-ups/hour, revenue/day.
- Dependencies: DB pool utilization, Kafka lag, downstream call latency.
Example service-overview JSON snippet:
{
"title": "Orders Service",
"panels": [
{
"title": "Requests/sec",
"targets": [{ "expr": "sum(rate(http_server_requests_seconds_count{application=\"orders\"}[5m]))" }]
},
{
"title": "Error rate",
"targets": [{ "expr": "sum(rate(http_server_requests_seconds_count{application=\"orders\", status=~\"5..\"}[5m]))" }]
},
{
"title": "p95 latency",
"targets": [{ "expr": "histogram_quantile(0.95, sum by (le) (rate(http_server_requests_seconds_bucket{application=\"orders\"}[5m])))" }]
}
]
}
Save the JSON in Git, mount it via a Grafana ConfigMap provisioner, and dashboards become code.
Step 7 — Alerting rules
Alert on symptoms (slow requests, high error rate) rather than causes (high CPU). A symptom-based alert maps directly to a user-visible problem.
```yaml
# rules/alerts.yml
groups:
- name: orders-slo
rules:
- alert: HighErrorRate
expr: |
(sum(rate(http_server_requests_seconds_count{application="orders", status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{application="orders"}[5m]))) > 0.02
for: 10m
labels: { severity: page }
annotations:
summary: "Orders service 5xx > 2% for 10m"
runbook: "https://runbooks/orders/error-rate"- alert: SlowP95 expr: histogram_quantile(0.95, sum by (le) (rate(http_server_requests_seconds_bucket{application="orders"}[5m]))) > 0.5 for: 15m labels: { severity: ticket } annotations: summary: "Orders p95 > 500ms for 15m" ```
# alertmanager.yml
route:
receiver: slack-default
routes:
- matchers: [severity="page"]
receiver: pagerduty
receivers:
- name: slack-default
slack_configs:
- channel: "#alerts"
api_url: "$SLACK_WEBHOOK"
- name: pagerduty
pagerduty_configs:
- service_key: "$PD_KEY"
Step 8 — Kubernetes monitoring
In a cluster, replace static scrape configs with the kube-prometheus-stack Helm chart, which installs Prometheus, Grafana, AlertManager, node-exporter and kube-state-metrics in one shot.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
-f values.yaml
A minimal values.yaml:
grafana:
adminPassword: change-me
defaultDashboardsTimezone: utc
persistence: { enabled: true, size: 10Gi }
alertmanager:
config:
route: { receiver: slack }
receivers:
- name: slack
slack_configs:
- channel: "#alerts"
api_url: "$SLACK_WEBHOOK"
prometheus:
prometheusSpec:
retention: 30d
resources:
requests: { cpu: "500m", memory: "2Gi" }
Discover Spring Boot pods automatically by annotating them:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/actuator/prometheus"
For OpenTelemetry traces and logs, deploy the OTel Collector as a DaemonSet so every node ships telemetry locally.
Architecture
Kubernetes Monitoring Stack
Configuration examples
A ServiceMonitor to scrape a service via the Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: orders
namespace: monitoring
labels: { release: monitoring }
spec:
namespaceSelector: { matchNames: [orders] }
selector: { matchLabels: { app: orders } }
endpoints:
- port: http
path: /actuator/prometheus
interval: 15s
A multi-burn-rate SLO alert (recommended for any SLI):
- alert: SLOBurnFast
expr: |
(
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count[5m]))
) > (14.4 * 0.001) # 0.1% SLO budget, 1h window
for: 5m
labels: { severity: page }
Security considerations
- Lock down /actuator endpoints. Expose only
health,infoandprometheus; secure with Spring Security or a network policy. - TLS between collectors and backends. OTLP over HTTPS, Prometheus remote-write over HTTPS.
- Drop PII before export. Use OTel Collector processors (
attributes/redact,filter) to strip emails, tokens, card numbers. - Authentication on Grafana. Wire SSO (OIDC/Google/GitHub). Disable anonymous access.
- Retention and storage limits. Cap Prometheus retention; rotate Loki and Tempo storage in object storage with lifecycle rules.
- Multi-tenancy. In a shared cluster, use Mimir or per-tenant Prometheus to isolate teams.
Production best practices
- Define SLOs first. Pick 99.9% availability and 200 ms p95 *before* you build dashboards. Alerts derive from SLOs.
- Use the RED and USE methods. Every service gets Rate, Errors, Duration; every resource gets Utilization, Saturation, Errors.
- Standardize labels.
application,environment,team,versionon every metric. - Sample traces. 100% sampling at scale is unaffordable. 1–10% with tail-based sampling for errors is the sweet spot.
- Treat dashboards and alerts as code. Store JSON in Git, deploy via Grafana provisioning.
- Alert on burn rate, not raw error count. Multi-window multi-burn-rate alerts catch fast and slow regressions.
- Run a chaos drill. Kill a pod, trip a circuit, fill a disk. Confirm the alert fires and the dashboard tells the story.
Common mistakes
- Alerting on CPU. It is a cause, not a symptom. Users do not feel CPU.
- Dashboards with 50 panels. Nobody can read them in an incident. Aim for ≤8 per screen.
- Unbounded label cardinality. Adding
userIdas a label explodes Prometheus memory. - Logging structured data as a free-form string. Use JSON logs and let Loki parse fields.
- One global trace sampler. Sample errors at 100%, normal traffic at 1%.
- Ignoring exemplars. Exemplars link a metric data point to a trace — the single best feature for incident drill-down.
Troubleshooting
/actuator/prometheusreturns 404. Forgot to include the registry dependency, or did not expose the endpoint.- Metrics have no labels for service. Set
management.metrics.tags.application. - Trace IDs are missing in logs. Add
%X{trace_id}to your Logback pattern and ensure the OTel agent injects MDC. - Prometheus memory grows unbounded. A label cardinality explosion — usually a route with a path parameter. Aggregate before storing.
- Grafana panel shows "No data". Wrong PromQL label selector or the metric simply has no data yet. Verify in Prometheus first.
- Tempo says trace not found. Sampling discarded it. Lower the threshold or enable tail-based sampling for errors.
FAQ
Prometheus vs Datadog? Datadog is a managed product with broader integrations and a price tag that scales with hosts. Prometheus + Grafana is free, infinitely extensible, and the de-facto open standard in Kubernetes shops.
Do I need OpenTelemetry if I already have Micrometer? Micrometer gives you metrics. OpenTelemetry adds traces, logs, and a vendor-neutral export protocol. They coexist — Micrometer can export via OTLP.
Loki vs ELK? Loki indexes only labels, not log content — cheaper to run, perfect when logs are correlated by labels (job, namespace, trace-id). ELK indexes everything — more powerful full-text search, more expensive.
Tempo vs Jaeger? Tempo stores traces in object storage, scales horizontally, integrates natively with Grafana, and only needs an ID lookup. Jaeger has a richer query API. Both are CNCF projects.
How long should I retain metrics? 30 days hot in Prometheus, 1+ year downsampled in Thanos, Cortex or Mimir.
How do I correlate logs and traces?
Inject trace_id and span_id into your logger MDC and add them to the log line. Grafana auto-links from a log to the matching trace.
What is a good error budget? A 99.9% SLO gives 43 minutes of allowed downtime per month. Burn 25% in an hour → page. Burn 5% over six hours → ticket.
Should I export from the app or via the OTel Collector? The Collector. It decouples export from your code, batches efficiently, applies redaction, and lets you change backends without redeploying services.
Do I need a separate Alertmanager per environment? One per Prometheus cluster is enough. Use routing trees and label-based silences to keep alerts targeted.
How much does a self-hosted stack cost? For a 50-service team: 2 vCPU and 8 GB for Prometheus, 1 vCPU and 4 GB for Grafana, 2 vCPU and 4 GB for the OTel Collector. Add object storage for long-term retention. Cheap compared to any SaaS at this scale.
Key takeaways
- Observability stands on three pillars — metrics, logs, traces — and you need all three.
- OpenTelemetry is the vendor-neutral standard for collecting and shipping telemetry.
- Prometheus + Grafana give you metrics and dashboards for free, on commodity hardware.
- Define SLOs first, alert on symptoms, and alert on burn rate, not raw counts.
- Treat dashboards, alerts and Collector configs as code in Git.
- In Kubernetes, the kube-prometheus-stack chart gets you to production in one command.
Related tutorials
- Spring Boot Microservices Architecture Explained
- Kubernetes Basics for Java Developers
- GitOps with ArgoCD
- Infrastructure as Code with Terraform
- Dockerizing a Spring Boot Application
- Backend & DevOps Roadmaps
Conclusion
Observability is what separates a team that *responds* to incidents from a team that *prevents* them. With Prometheus, Grafana and OpenTelemetry, you can ship a production-grade telemetry stack in a few hundred lines of YAML and a handful of Spring Boot dependencies. Instrument early, define SLOs deliberately, and your microservices stop being a black box — and start being a system you understand.
Architecture
Microservices Architecture
TL;DR
Key takeaways
- Understand the core concepts behind Observability in Microservices — Prometheus, Grafana and OpenTelemetry in a production context.
- Apply the patterns to real DevOps & CI/CD systems, not just toy examples.
- Recognize the trade-offs, failure modes, and operational concerns before adopting them.
- Get a clear path to the next step — related tutorials, tools, and reference architectures.
Avoid these
Common mistakes
1. Copy-pasting code without understanding the trade-offs
It's tempting to ship a snippet from a blog post into production, but DevOps & CI/CD patterns only work when the failure modes are understood. Always reason about timeouts, retries, and consistency.
2. Skipping observability from day one
Structured logs, metrics, and traces are not optional. Wire them in before you ship — debugging DevOps & CI/CD systems without them is painful and expensive.
3. Optimizing too early
Premature caching, sharding, or microservice extraction adds operational cost. Validate the bottleneck with real measurements first.
4. Ignoring security defaults
Secrets in env files, open management ports, missing RBAC — these are the most common production incidents. Treat security as part of the definition of done.
Ship it safely
Production best practices
Apply these before promoting Observability in Microservices — Prometheus, Grafana and OpenTelemetry to a real production environment.
Scalability
Design DevOps & CI/CD services to scale horizontally. Keep request handlers stateless, push session and cache state to external stores (Redis, the database), and benchmark p95/p99 latency under realistic load before tuning.
Monitoring & Observability
Emit metrics (RED/USE), structured JSON logs, and distributed traces from day one. Wire dashboards and alerts to SLOs you actually care about — error rate, latency, saturation — not vanity metrics.
Logging
Log with correlation IDs, never log secrets or PII, and centralize logs (ELK, Loki, CloudWatch). Use levels deliberately: INFO for state changes, WARN for recoverable issues, ERROR for incidents.
Security
Apply least-privilege IAM, rotate secrets through a vault, validate every input, and patch dependencies on a schedule. For HTTP services, enable TLS everywhere and set sensible security headers.
Testing
Layer unit, integration, and contract tests. Run them in CI on every PR, and add smoke tests post-deploy. For DevOps & CI/CD systems, also run chaos and load tests before a major release.
Reliability & Rollouts
Ship with health checks, readiness probes, graceful shutdown, and a rollback strategy. Prefer canary or blue/green deploys over big-bang releases.
Questions
Frequently asked questions
Is this tutorial up to date?
Yes. This tutorial was last reviewed and updated on June 2, 2026. We revisit popular DevOps & CI/CD tutorials regularly to keep them aligned with current best practices.
What level is this tutorial aimed at?
It is written for working developers with some backend experience. Beginners can still follow along, and senior engineers will find production-grade patterns and trade-off discussions.
Do I need to follow every step in order?
The walkthrough is sequential because each step depends on the previous one. If you only need a specific concept, the table of contents at the top of the article lets you jump straight to that section.
Where can I find the source code?
The full source code is available on GitHub: https://github.com/masterlabsystems/observability-stack-demo. Fork it, run it locally, and adapt it to your own project.
Go deeper
Further reading
Source Code
Get the full project on GitHub
More From the Channel
Follow the full tutorial series on YouTube
The MasterLabSystems channel publishes in-depth, project-based tutorials on Java, Spring Boot, microservices, Docker, Kubernetes, AWS and DevOps — the same topics covered on this site, with full code walkthroughs.
Stay in the Loop
Get the next tutorial in your inbox
next tutorial →
CQRS Pattern in Spring Boot — Separating Reads and Writes for Scale
Related tutorials
CI/CD Pipeline with GitHub Actions and Docker
Build a complete CI/CD pipeline that tests, builds and pushes a Spring Boot Docker image on every push using GitHub Actions.
Automating Database Migrations with Flyway and Spring Boot in a CI/CD Pipeline
Ship safe, versioned, zero-downtime database migrations with Flyway and Spring Boot — including PostgreSQL examples, multi-environment handling and a complete GitHub Actions pipeline.
GitOps with ArgoCD — The Modern Kubernetes Deployment Strategy
A complete, production-grade guide to GitOps with ArgoCD on Kubernetes — workflow, architecture, multi-environment promotion, auto-sync, rollbacks and Spring Boot deployments.
