DevOps & CI/CD20 min read·By Liyabona Saki·

Observability in Microservices — Prometheus, Grafana and OpenTelemetry

End-to-end observability for Spring Boot microservices: metrics with Prometheus, dashboards with Grafana, distributed tracing with OpenTelemetry, alerting and Kubernetes monitoring.

Advertisement

Introduction

When a single user request crosses six microservices, three databases, a message broker and a third-party API, the question is no longer *did it work?* but *where did it spend its time, and which hop failed?* Logs alone cannot answer that. Observability — the discipline of being able to ask new questions of a running system without shipping new code — is what makes complex distributed systems operable.

This tutorial builds a complete observability stack for Spring Boot microservices: metrics with Prometheus, dashboards with Grafana, distributed tracing with OpenTelemetry, logs routed to Loki, and alerting with AlertManager. You will instrument your services, scrape and store telemetry, build dashboards and SLO-driven alerts, and deploy the whole stack to Kubernetes.

Why observability matters

Three signals — metrics, logs and traces — answer the three questions that matter during an incident:

  • Metrics: is the system healthy *right now*, and what is the trend?
  • Logs: what exactly happened on this one request?
  • Traces: where did the latency or error come from across services?

Without all three, you are debugging in the dark. The good news is that modern open-source tools (Prometheus, Grafana, OpenTelemetry, Loki, Tempo, AlertManager) integrate so well that a small team can run a Datadog-quality stack on their own infrastructure.

Architecture

Observability Architecture

SERVICESTELEMETRYBACKENDSVISUALIZATIONALERTINGOTLPmetricslogstracesqueryalert rulesOrder ServiceMicrometer · OTel SDKPayment ServiceMicrometer · OTel SDKUser ServiceMicrometer · OTel SDKOpenTelemetry Collectormetrics · logs · tracesPrometheusTSDB · PromQLLoki / ELKlog storeTempo / Jaegertrace storeGrafanadashboardsAlertManagerSlack · PagerDuty
Spring Boot services export metrics, logs and traces via OpenTelemetry. Prometheus scrapes metrics, Grafana visualizes them, and AlertManager fires alerts on SLO violations.

Real-world use cases

  • Incident response. Page on SLO burn rate, then drill from a Grafana panel into traces and logs in seconds.
  • Performance regression detection. Compare p95 latency week-over-week per endpoint.
  • Capacity planning. Trend CPU, memory and request rate to predict the next scale event.
  • Customer support. Pull every span and log for a given trace-id to answer "what happened to my order?"
  • Cost attribution. Tag spans with team and feature labels to bill internal consumers.

Architecture overview

We deploy three telemetry pipelines that share a single collector:

  • Services emit OTLP to the OpenTelemetry Collector.
  • The collector fans out to Prometheus (metrics), Loki/ELK (logs) and Tempo/Jaeger (traces).
  • Grafana queries all three and renders unified dashboards.
  • AlertManager receives alerts from Prometheus and routes them to Slack, PagerDuty or email.

Step 1 — Install Prometheus

Run Prometheus locally with Docker:

yaml
# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    ports: ["9090:9090"]
yaml
# prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'spring-boot'
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ['host.docker.internal:8080']
bash
docker compose up -d prometheus
open http://localhost:9090

Step 2 — Instrument Spring Boot

Add Micrometer + the Prometheus registry:

xml
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
yaml
# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health, info, prometheus, metrics
  metrics:
    tags:
      application: ${spring.application.name}

Visit http://localhost:8080/actuator/prometheus — you should see hundreds of built-in metrics covering JVM, HTTP, datasource and Kafka.

Add a custom metric:

java
@RestController
public class OrderController {
  private final Counter ordersCreated;
  public OrderController(MeterRegistry reg) {
    this.ordersCreated = Counter.builder("orders_created_total")
        .description("Orders successfully created")
        .tag("source", "api")
        .register(reg);
  }
  @PostMapping("/orders")
  public ResponseEntity<?> create() {
    ordersCreated.increment();
    return ResponseEntity.ok().build();
  }
}

Step 3 — Install Grafana

yaml
# docker-compose.yml (continued)
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin

Open Grafana, add Prometheus as a data source (http://prometheus:9090), and import the official JVM (Micrometer) dashboard (ID 4701) for an instant baseline.

A custom panel for our counter:

promql
rate(orders_created_total[5m])

A request-rate panel from the built-in HTTP metrics:

promql
sum by (uri) (rate(http_server_requests_seconds_count[5m]))

p95 latency:

promql
histogram_quantile(0.95,
  sum by (le, uri) (rate(http_server_requests_seconds_bucket[5m])))

Step 4 — OpenTelemetry setup

OpenTelemetry standardises instrumentation, propagation, and export across languages. We will run the OpenTelemetry Collector as a sidecar and configure Spring Boot to push OTLP to it.

```yaml
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors: batch: {} memory_limiter: check_interval: 1s limit_mib: 512

exporters: prometheusremotewrite: endpoint: http://prometheus:9090/api/v1/write otlphttp/tempo: endpoint: http://tempo:4318 loki: endpoint: http://loki:3100/loki/api/v1/push

service: pipelines: metrics: { receivers: [otlp], processors: [batch, memory_limiter], exporters: [prometheusremotewrite] } traces: { receivers: [otlp], processors: [batch], exporters: [otlphttp/tempo] } logs: { receivers: [otlp], processors: [batch], exporters: [loki] } ```

yaml
# docker-compose.yml (continued)
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otelcol/config.yaml"]
    volumes: ["./otel-collector-config.yaml:/etc/otelcol/config.yaml:ro"]
    ports: ["4317:4317", "4318:4318"]

Step 5 — Spring Boot instrumentation with OpenTelemetry

Add the OTel starter:

xml
<dependency>
  <groupId>io.opentelemetry.instrumentation</groupId>
  <artifactId>opentelemetry-spring-boot-starter</artifactId>
  <version>2.6.0</version>
</dependency>
yaml
# application.yml
otel:
  service:
    name: orders-service
  exporter:
    otlp:
      endpoint: http://otel-collector:4318
  traces:
    sampler: parentbased_traceidratio
    sampler.arg: 0.1     # 10% sampling

Spring Boot auto-instrumentation now wraps every @RestController, RestTemplate, WebClient, JDBC call, Kafka producer/consumer and scheduled task. Every span carries the same trace-id, propagated through W3C traceparent headers.

A custom span for a business operation:

```java
@Autowired Tracer tracer;

public Order checkout(Cart cart) { Span span = tracer.spanBuilder("checkout").startSpan(); try (Scope s = span.makeCurrent()) { span.setAttribute("cart.size", cart.size()); return doCheckout(cart); } finally { span.end(); } } ```

Distributed tracing across services

The whole point of OpenTelemetry is that one trace-id flows from the user's request all the way through the system.

Architecture

Distributed Tracing Across Microservices

USERGATEWAYUSER SERVICEORDER SERVICEPAYMENT SERVICEDATAtrace-idpropagatectx headersctx headerstagged spanUser RequestHTTP /checkoutAPI Gatewaytrace-id: a3f9…User Servicespan: auth · 12 msOrder Servicespan: create · 38 msPayment Servicespan: charge · 142 msPostgreSQLspan: insert · 18 ms
A single user request propagates one trace ID through API Gateway, User, Order and Payment services down to the database. Each hop emits a span with latency, building an end-to-end timeline.

In Grafana, open Tempo, paste a trace-id from your logs, and you get a flame graph of every span with timings, attributes and links to the corresponding logs.

Step 6 — Dashboard creation

A production dashboard usually has four rows:

  • RED: Requests per second, Error rate, Duration (p50/p95/p99) per service.
  • USE: Utilization, Saturation, Errors for CPU, memory, disk.
  • Business KPIs: orders/min, sign-ups/hour, revenue/day.
  • Dependencies: DB pool utilization, Kafka lag, downstream call latency.

Example service-overview JSON snippet:

json
{
  "title": "Orders Service",
  "panels": [
    {
      "title": "Requests/sec",
      "targets": [{ "expr": "sum(rate(http_server_requests_seconds_count{application=\"orders\"}[5m]))" }]
    },
    {
      "title": "Error rate",
      "targets": [{ "expr": "sum(rate(http_server_requests_seconds_count{application=\"orders\", status=~\"5..\"}[5m]))" }]
    },
    {
      "title": "p95 latency",
      "targets": [{ "expr": "histogram_quantile(0.95, sum by (le) (rate(http_server_requests_seconds_bucket{application=\"orders\"}[5m])))" }]
    }
  ]
}

Save the JSON in Git, mount it via a Grafana ConfigMap provisioner, and dashboards become code.

Step 7 — Alerting rules

Alert on symptoms (slow requests, high error rate) rather than causes (high CPU). A symptom-based alert maps directly to a user-visible problem.

```yaml
# rules/alerts.yml
groups:
  - name: orders-slo
    rules:
      - alert: HighErrorRate
        expr: |
          (sum(rate(http_server_requests_seconds_count{application="orders", status=~"5.."}[5m]))
           /
           sum(rate(http_server_requests_seconds_count{application="orders"}[5m]))) > 0.02
        for: 10m
        labels: { severity: page }
        annotations:
          summary: "Orders service 5xx > 2% for 10m"
          runbook: "https://runbooks/orders/error-rate"

- alert: SlowP95 expr: histogram_quantile(0.95, sum by (le) (rate(http_server_requests_seconds_bucket{application="orders"}[5m]))) > 0.5 for: 15m labels: { severity: ticket } annotations: summary: "Orders p95 > 500ms for 15m" ```

yaml
# alertmanager.yml
route:
  receiver: slack-default
  routes:
    - matchers: [severity="page"]
      receiver: pagerduty
receivers:
  - name: slack-default
    slack_configs:
      - channel: "#alerts"
        api_url: "$SLACK_WEBHOOK"
  - name: pagerduty
    pagerduty_configs:
      - service_key: "$PD_KEY"

Step 8 — Kubernetes monitoring

In a cluster, replace static scrape configs with the kube-prometheus-stack Helm chart, which installs Prometheus, Grafana, AlertManager, node-exporter and kube-state-metrics in one shot.

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace \
  -f values.yaml

A minimal values.yaml:

yaml
grafana:
  adminPassword: change-me
  defaultDashboardsTimezone: utc
  persistence: { enabled: true, size: 10Gi }
alertmanager:
  config:
    route: { receiver: slack }
    receivers:
      - name: slack
        slack_configs:
          - channel: "#alerts"
            api_url: "$SLACK_WEBHOOK"
prometheus:
  prometheusSpec:
    retention: 30d
    resources:
      requests: { cpu: "500m", memory: "2Gi" }

Discover Spring Boot pods automatically by annotating them:

yaml
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/actuator/prometheus"

For OpenTelemetry traces and logs, deploy the OTel Collector as a DaemonSet so every node ships telemetry locally.

Architecture

Kubernetes Monitoring Stack

WORKLOADSTELEMETRYMETRICSDASHBOARDSALERTSOTLPscrape /metricsPromQLalert rulesApp Pod (orders)App Pod (payments)App Pod (users)OpenTelemetry CollectorDaemonSetPrometheusStatefulSet · TSDBGrafanaConfigMap dashboardsAlertManagerSlack · PagerDuty · Email
Inside the cluster, application pods emit telemetry to the OpenTelemetry Collector. Prometheus scrapes metrics, Grafana visualizes them, and AlertManager dispatches alerts to on-call channels.

Configuration examples

A ServiceMonitor to scrape a service via the Prometheus Operator:

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: orders
  namespace: monitoring
  labels: { release: monitoring }
spec:
  namespaceSelector: { matchNames: [orders] }
  selector: { matchLabels: { app: orders } }
  endpoints:
    - port: http
      path: /actuator/prometheus
      interval: 15s

A multi-burn-rate SLO alert (recommended for any SLI):

yaml
- alert: SLOBurnFast
  expr: |
    (
      sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
      / sum(rate(http_server_requests_seconds_count[5m]))
    ) > (14.4 * 0.001)   # 0.1% SLO budget, 1h window
  for: 5m
  labels: { severity: page }

Security considerations

  • Lock down /actuator endpoints. Expose only health, info and prometheus; secure with Spring Security or a network policy.
  • TLS between collectors and backends. OTLP over HTTPS, Prometheus remote-write over HTTPS.
  • Drop PII before export. Use OTel Collector processors (attributes/redact, filter) to strip emails, tokens, card numbers.
  • Authentication on Grafana. Wire SSO (OIDC/Google/GitHub). Disable anonymous access.
  • Retention and storage limits. Cap Prometheus retention; rotate Loki and Tempo storage in object storage with lifecycle rules.
  • Multi-tenancy. In a shared cluster, use Mimir or per-tenant Prometheus to isolate teams.

Production best practices

  • Define SLOs first. Pick 99.9% availability and 200 ms p95 *before* you build dashboards. Alerts derive from SLOs.
  • Use the RED and USE methods. Every service gets Rate, Errors, Duration; every resource gets Utilization, Saturation, Errors.
  • Standardize labels. application, environment, team, version on every metric.
  • Sample traces. 100% sampling at scale is unaffordable. 1–10% with tail-based sampling for errors is the sweet spot.
  • Treat dashboards and alerts as code. Store JSON in Git, deploy via Grafana provisioning.
  • Alert on burn rate, not raw error count. Multi-window multi-burn-rate alerts catch fast and slow regressions.
  • Run a chaos drill. Kill a pod, trip a circuit, fill a disk. Confirm the alert fires and the dashboard tells the story.

Common mistakes

  • Alerting on CPU. It is a cause, not a symptom. Users do not feel CPU.
  • Dashboards with 50 panels. Nobody can read them in an incident. Aim for ≤8 per screen.
  • Unbounded label cardinality. Adding userId as a label explodes Prometheus memory.
  • Logging structured data as a free-form string. Use JSON logs and let Loki parse fields.
  • One global trace sampler. Sample errors at 100%, normal traffic at 1%.
  • Ignoring exemplars. Exemplars link a metric data point to a trace — the single best feature for incident drill-down.

Troubleshooting

  • /actuator/prometheus returns 404. Forgot to include the registry dependency, or did not expose the endpoint.
  • Metrics have no labels for service. Set management.metrics.tags.application.
  • Trace IDs are missing in logs. Add %X{trace_id} to your Logback pattern and ensure the OTel agent injects MDC.
  • Prometheus memory grows unbounded. A label cardinality explosion — usually a route with a path parameter. Aggregate before storing.
  • Grafana panel shows "No data". Wrong PromQL label selector or the metric simply has no data yet. Verify in Prometheus first.
  • Tempo says trace not found. Sampling discarded it. Lower the threshold or enable tail-based sampling for errors.

FAQ

Prometheus vs Datadog? Datadog is a managed product with broader integrations and a price tag that scales with hosts. Prometheus + Grafana is free, infinitely extensible, and the de-facto open standard in Kubernetes shops.

Do I need OpenTelemetry if I already have Micrometer? Micrometer gives you metrics. OpenTelemetry adds traces, logs, and a vendor-neutral export protocol. They coexist — Micrometer can export via OTLP.

Loki vs ELK? Loki indexes only labels, not log content — cheaper to run, perfect when logs are correlated by labels (job, namespace, trace-id). ELK indexes everything — more powerful full-text search, more expensive.

Tempo vs Jaeger? Tempo stores traces in object storage, scales horizontally, integrates natively with Grafana, and only needs an ID lookup. Jaeger has a richer query API. Both are CNCF projects.

How long should I retain metrics? 30 days hot in Prometheus, 1+ year downsampled in Thanos, Cortex or Mimir.

How do I correlate logs and traces? Inject trace_id and span_id into your logger MDC and add them to the log line. Grafana auto-links from a log to the matching trace.

What is a good error budget? A 99.9% SLO gives 43 minutes of allowed downtime per month. Burn 25% in an hour → page. Burn 5% over six hours → ticket.

Should I export from the app or via the OTel Collector? The Collector. It decouples export from your code, batches efficiently, applies redaction, and lets you change backends without redeploying services.

Do I need a separate Alertmanager per environment? One per Prometheus cluster is enough. Use routing trees and label-based silences to keep alerts targeted.

How much does a self-hosted stack cost? For a 50-service team: 2 vCPU and 8 GB for Prometheus, 1 vCPU and 4 GB for Grafana, 2 vCPU and 4 GB for the OTel Collector. Add object storage for long-term retention. Cheap compared to any SaaS at this scale.

Key takeaways

  • Observability stands on three pillars — metrics, logs, traces — and you need all three.
  • OpenTelemetry is the vendor-neutral standard for collecting and shipping telemetry.
  • Prometheus + Grafana give you metrics and dashboards for free, on commodity hardware.
  • Define SLOs first, alert on symptoms, and alert on burn rate, not raw counts.
  • Treat dashboards, alerts and Collector configs as code in Git.
  • In Kubernetes, the kube-prometheus-stack chart gets you to production in one command.

Related tutorials

Conclusion

Observability is what separates a team that *responds* to incidents from a team that *prevents* them. With Prometheus, Grafana and OpenTelemetry, you can ship a production-grade telemetry stack in a few hundred lines of YAML and a handful of Spring Boot dependencies. Instrument early, define SLOs deliberately, and your microservices stop being a black box — and start being a system you understand.

Architecture

Microservices Architecture

CLIENTAPI GATEWAYSERVICESDATAEXTERNALRESTpublishsubscribeWeb AppMobile AppAPI GatewayRouting · AuthUsers ServiceOrders ServiceBilling ServiceUsers DBPostgreSQLOrders DBPostgreSQLEvent BusKafkaStripePaymentsEmail APISES / SendGrid
An API gateway routes traffic to independent services. Each service owns its data and communicates via REST or async events.

TL;DR

Key takeaways

  • Understand the core concepts behind Observability in Microservices — Prometheus, Grafana and OpenTelemetry in a production context.
  • Apply the patterns to real DevOps & CI/CD systems, not just toy examples.
  • Recognize the trade-offs, failure modes, and operational concerns before adopting them.
  • Get a clear path to the next step — related tutorials, tools, and reference architectures.

Avoid these

Common mistakes

  • 1. Copy-pasting code without understanding the trade-offs

    It's tempting to ship a snippet from a blog post into production, but DevOps & CI/CD patterns only work when the failure modes are understood. Always reason about timeouts, retries, and consistency.

  • 2. Skipping observability from day one

    Structured logs, metrics, and traces are not optional. Wire them in before you ship — debugging DevOps & CI/CD systems without them is painful and expensive.

  • 3. Optimizing too early

    Premature caching, sharding, or microservice extraction adds operational cost. Validate the bottleneck with real measurements first.

  • 4. Ignoring security defaults

    Secrets in env files, open management ports, missing RBAC — these are the most common production incidents. Treat security as part of the definition of done.

Ship it safely

Production best practices

Apply these before promoting Observability in Microservices — Prometheus, Grafana and OpenTelemetry to a real production environment.

Scalability

Design DevOps & CI/CD services to scale horizontally. Keep request handlers stateless, push session and cache state to external stores (Redis, the database), and benchmark p95/p99 latency under realistic load before tuning.

Monitoring & Observability

Emit metrics (RED/USE), structured JSON logs, and distributed traces from day one. Wire dashboards and alerts to SLOs you actually care about — error rate, latency, saturation — not vanity metrics.

Logging

Log with correlation IDs, never log secrets or PII, and centralize logs (ELK, Loki, CloudWatch). Use levels deliberately: INFO for state changes, WARN for recoverable issues, ERROR for incidents.

Security

Apply least-privilege IAM, rotate secrets through a vault, validate every input, and patch dependencies on a schedule. For HTTP services, enable TLS everywhere and set sensible security headers.

Testing

Layer unit, integration, and contract tests. Run them in CI on every PR, and add smoke tests post-deploy. For DevOps & CI/CD systems, also run chaos and load tests before a major release.

Reliability & Rollouts

Ship with health checks, readiness probes, graceful shutdown, and a rollback strategy. Prefer canary or blue/green deploys over big-bang releases.

Questions

Frequently asked questions

Is this tutorial up to date?

Yes. This tutorial was last reviewed and updated on June 2, 2026. We revisit popular DevOps & CI/CD tutorials regularly to keep them aligned with current best practices.

What level is this tutorial aimed at?

It is written for working developers with some backend experience. Beginners can still follow along, and senior engineers will find production-grade patterns and trade-off discussions.

Do I need to follow every step in order?

The walkthrough is sequential because each step depends on the previous one. If you only need a specific concept, the table of contents at the top of the article lets you jump straight to that section.

Where can I find the source code?

The full source code is available on GitHub: https://github.com/masterlabsystems/observability-stack-demo. Fork it, run it locally, and adapt it to your own project.

Go deeper

Further reading

Source Code

Get the full project on GitHub

View repo →
#Observability#Prometheus#Grafana#OpenTelemetry#Monitoring

More From the Channel

Follow the full tutorial series on YouTube

The MasterLabSystems channel publishes in-depth, project-based tutorials on Java, Spring Boot, microservices, Docker, Kubernetes, AWS and DevOps — the same topics covered on this site, with full code walkthroughs.

Stay in the Loop

Get the next tutorial in your inbox

next tutorial →

CQRS Pattern in Spring Boot — Separating Reads and Writes for Scale

Related tutorials