Designing Event-Driven Microservices with Kafka and Spring Boot
A complete production guide to event-driven microservices with Kafka and Spring Boot — producers, consumers, topics, partitions, consumer groups, retry strategies, schema evolution and operational best practices.
Introduction
In a synchronous microservices architecture, every interaction is a network call: order calls inventory, inventory calls warehouse, warehouse calls notification. One slow downstream service slows the entire chain; one failure cascades through the system. The fix is to invert the dependency direction with events.
Event-driven microservices communicate by publishing and subscribing to immutable domain events through a broker like Apache Kafka. Producers don't know who consumes; consumers don't know who produces. Each service moves at its own pace, fails independently, and scales independently. This tutorial is a complete production walkthrough of building event-driven microservices with Kafka and Spring Boot — topics, partitions, consumer groups, retries, schema evolution and the operational practices that keep the system healthy.
Why event-driven
Three benefits compound as the system grows:
- Decoupling. Adding a new consumer requires zero changes to the producer. A new analytics service can subscribe to
orders.eventswithout anyone shipping new code in the order service. - Resilience. If the payment service is down for two minutes, events queue in Kafka. When it recovers, it catches up. Synchronous calls would have failed and lost data.
- Scalability. Kafka partitions let each consumer parallelize trivially. A single topic can absorb millions of events per second.
The trade-off: eventual consistency, harder debugging (no stack trace across services), and a broker to operate.
Architecture
Event-Driven Microservices Architecture
Real-world use cases
- E-commerce checkout — order, inventory, payment, shipping and notification services coordinate via events.
- Real-time analytics — every interesting event is logged to Kafka and consumed by streaming jobs.
- CDC pipelines — Debezium streams database changes to Kafka, fanning out to warehouses and search indexes.
- IoT telemetry — millions of devices write to Kafka; consumers process, aggregate and alert.
- User activity tracking — clickstreams flow to Kafka and are projected into recommendation systems.
Kafka fundamentals
Three concepts run everything:
- Topic — a named, append-only log. Producers write to it; consumers read from it.
- Partition — each topic is split into N partitions. Order is guaranteed *within* a partition, not across.
- Consumer group — a set of consumers that share work on a topic. Each partition is consumed by exactly one member of the group at a time.
If you have 6 partitions and 3 consumers in a group, each consumer handles 2 partitions. Add a fourth consumer — Kafka rebalances. Add a seventh — it sits idle (no partition to assign).
Architecture
Kafka Topics, Partitions & Consumer Groups
Step 1 — Set up Spring Boot + Kafka
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
</dependency>
spring:
kafka:
bootstrap-servers: kafka:9092
producer:
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
acks: all
properties:
enable.idempotence: true
max.in.flight.requests.per.connection: 5
consumer:
group-id: orders
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer
properties:
spring.json.trusted.packages: "com.example.events"
enable-auto-commit: false
auto-offset-reset: earliest
listener:
ack-mode: manual
Two settings matter most:
- acks=all + enable.idempotence=true — durable, deduped producer.
- enable-auto-commit: false + ack-mode: manual — the consumer only commits the offset after successful processing.
Step 2 — Define events
Events are immutable, versioned and broker-friendly.
```java
public sealed interface OrderEvent permits OrderPlaced, OrderPaid, OrderShipped {}public record OrderPlaced( UUID eventId, UUID orderId, UUID customerId, BigDecimal total, int schemaVersion ) implements OrderEvent {} ```
Include eventId for consumer-side dedupe and schemaVersion from day one.
Step 3 — Producer
```java
@Service
@RequiredArgsConstructor
public class OrderEventPublisher {
private final KafkaTemplate<String, OrderEvent> kafka;public void publish(OrderEvent event) { String key = switch (event) { case OrderPlaced e -> e.orderId().toString(); case OrderPaid e -> e.orderId().toString(); case OrderShipped e -> e.orderId().toString(); }; kafka.send("orders.events", key, event); } } ```
Setting the Kafka key to the aggregate ID ensures all events for one order go to the same partition — and therefore arrive in order.
For at-least-once + atomic write semantics, combine this with the Outbox pattern instead of publishing directly.
Step 4 — Consumer
```java
@Component
@RequiredArgsConstructor
public class InventoryListener {private final InventoryService inventory; private final ProcessedEventRepository processed;
@KafkaListener( topics = "orders.events", groupId = "inventory", concurrency = "3" ) @Transactional public void on(OrderEvent event, Acknowledgment ack) { if (processed.existsById(eventId(event))) { ack.acknowledge(); return; } if (event instanceof OrderPlaced p) { inventory.reserve(p.orderId(), p.total()); } processed.save(new ProcessedEvent(eventId(event), Instant.now())); ack.acknowledge(); } } ```
concurrency = "3" starts three consumer threads on this instance. With 6 partitions and 2 instances of 3 threads, each thread owns 1 partition.
Architecture
Event-Driven Order Processing Workflow
Step 5 — Retries and dead-letter topics
A consumer should retry transient failures and quarantine poison messages. Spring Kafka has first-class support.
```java
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> template) {
var recoverer = new DeadLetterPublishingRecoverer(template,
(record, ex) -> new TopicPartition(record.topic() + ".DLT", record.partition()));var backoff = new ExponentialBackOffWithMaxRetries(5); backoff.setInitialInterval(500); backoff.setMultiplier(2.0); backoff.setMaxInterval(10_000);
var handler = new DefaultErrorHandler(recoverer, backoff); handler.addNotRetryableExceptions(IllegalArgumentException.class); return handler; } ```
Transient errors retry with exponential backoff; permanent errors land in orders.events.DLT for manual inspection. The DLT is a Kafka topic like any other — you can build a simple admin UI that reads it.
Step 6 — Schema evolution
Events live forever in Kafka. Plan for change.
- Add fields, never remove. Old consumers ignore unknown fields.
- Make new fields nullable or defaulted. Old producers won't set them.
- Use a schema registry (Confluent or Apicurio) to enforce compatibility at publish time.
- Bump
schemaVersionfor breaking changes. Consumers can route to old vs new handlers, or run a one-off projection to upgrade older events.
Production best practices
- Partition for parallelism. Pick a partition count that supports peak throughput; doubling later is operationally painful.
- Set replication factor ≥3. Anything less is not durable enough for production.
- Monitor consumer lag.
kafka_consumergroup_lagis the single most important metric. Alert when it grows. - Use idempotent producers and consumers. At-least-once means you *will* see duplicates.
- Tune
max.poll.recordsandmax.poll.interval.ms. A slow handler can be kicked out of the group; size batches to fit the SLO. - Compact topics for state snapshots. Use
cleanup.policy=compacton topics that represent the latest value per key (user profiles, configs). - Separate hot and cold topics. Don't mix high-throughput firehoses with low-volume control events.
- Run a chaos test. Kill consumers, isolate brokers, watch the system rebalance and recover.
Security considerations
- Enable TLS on the broker; never run plaintext in production.
- Use SASL (SCRAM or OAuth) for client auth. mTLS is also fine if you have the PKI for it.
- Apply Kafka ACLs per topic and consumer group; least privilege from day one.
- Encrypt PII in event payloads. Brokers are operators with access — design assuming they can read what they store.
- Sign events with a payload hash if cross-org consumers may not be trusted.
Common mistakes
1. No idempotency. Treating Kafka as exactly-once. It is not.
2. Synchronous workflow disguised as events. Service A publishes, waits, service B publishes back. That is RPC with extra steps and worse latency.
3. Too few partitions. Throughput is capped at one consumer per partition — you cannot scale past that.
4. Wrong key. Random keys break per-aggregate ordering; constant keys overload one partition.
5. No DLT. A poison message blocks the partition for everyone behind it.
6. Schema evolution without a registry. A breaking change quietly takes down every consumer.
7. Long-running handlers. A handler that takes 60 seconds will exceed max.poll.interval.ms and trigger constant rebalances.
Troubleshooting guide
- Consumer lag growing. Add consumer instances up to the partition count; profile the handler; check downstream dependencies.
- Frequent rebalances. Reduce handler work per poll, increase
max.poll.interval.ms, or move blocking I/O off the consumer thread. - Out-of-order events. Verify producer key matches the entity ID and there is no manual partition assignment.
- DLT growing. Inspect a few messages, fix the bug, replay from the DLT back to the main topic.
- Mysterious duplicates. Confirm
enable.idempotence=trueon the producer and consumer-side dedupe. - Broker disk filling. Tune retention (
retention.ms,retention.bytes); enable compaction where appropriate.
FAQ
1. Should every microservice be event-driven? No. Use sync calls for low-latency, single-result reads. Use events for state changes, fan-out and decoupling.
2. Kafka vs RabbitMQ? Kafka for high-throughput durable streams and replay; RabbitMQ for flexible routing and RPC-style messaging. Pick by use case, not vibes.
3. How do I trigger immediate downstream action? Publish the event and have the downstream consumer act on it. Real-time means consumer lag in milliseconds, not synchronous calls.
4. How do I run sagas? Use the choreography pattern (each step listens for the previous event) or orchestration (a saga coordinator publishes and listens). Both work over Kafka.
5. What about exactly-once? Kafka transactions provide exactly-once within Kafka (read-process-write). Across heterogeneous systems, idempotent at-least-once is the realistic target.
6. Where do I store consumer state? In the consumer's own database. Don't use Kafka as a database.
7. How many partitions do I need? Estimate peak throughput per partition (rule of thumb: 10–25 MB/s per partition for typical hardware) and add headroom for future consumers.
8. How do I version event schemas? Backward-compatible additions; major bumps for breaking changes; enforce via a schema registry. See the schema evolution section.
9. How do I handle PII deletion (GDPR)? Either tombstone the key in a compacted topic, or encrypt PII per-user with a key you can later revoke (crypto-shredding).
10. Can I use Kafka without microservices? Absolutely. Modular monoliths benefit from events too — they make boundaries explicit even within one deployable.
Key takeaways
- Event-driven microservices decouple services in time and dependency, trading instant consistency for resilience and scale.
- Partition by aggregate key, run consumers in groups, and treat at-least-once delivery as a first-class requirement.
- Always pair commands with the Outbox pattern to avoid dual-write bugs.
- Plan for schema evolution from day one; use a registry to enforce it.
- Monitor consumer lag, configure DLTs, and run chaos tests — the broker is the central nervous system of the platform.
Related tutorials
- The Outbox Pattern — Reliable Event Publishing in Microservices
- CQRS Pattern in Spring Boot — Separating Reads and Writes for Scale
- Spring Boot Kafka Tutorial
- Spring Boot Microservices Architecture Explained Step by Step
- Circuit Breaker with Resilience4j and Spring Boot
- Blue-Green Deployments with Kubernetes — Zero Downtime Releases
Architecture
Event-Driven Architecture
TL;DR
Key takeaways
- Understand the core concepts behind Designing Event-Driven Microservices with Kafka and Spring Boot in a production context.
- Apply the patterns to real Microservices systems, not just toy examples.
- Recognize the trade-offs, failure modes, and operational concerns before adopting them.
- Get a clear path to the next step — related tutorials, tools, and reference architectures.
Avoid these
Common mistakes
1. Copy-pasting code without understanding the trade-offs
It's tempting to ship a snippet from a blog post into production, but Microservices patterns only work when the failure modes are understood. Always reason about timeouts, retries, and consistency.
2. Skipping observability from day one
Structured logs, metrics, and traces are not optional. Wire them in before you ship — debugging Microservices systems without them is painful and expensive.
3. Optimizing too early
Premature caching, sharding, or microservice extraction adds operational cost. Validate the bottleneck with real measurements first.
4. Ignoring security defaults
Secrets in env files, open management ports, missing RBAC — these are the most common production incidents. Treat security as part of the definition of done.
Ship it safely
Production best practices
Apply these before promoting Designing Event-Driven Microservices with Kafka and Spring Boot to a real production environment.
Scalability
Design Microservices services to scale horizontally. Keep request handlers stateless, push session and cache state to external stores (Redis, the database), and benchmark p95/p99 latency under realistic load before tuning.
Monitoring & Observability
Emit metrics (RED/USE), structured JSON logs, and distributed traces from day one. Wire dashboards and alerts to SLOs you actually care about — error rate, latency, saturation — not vanity metrics.
Logging
Log with correlation IDs, never log secrets or PII, and centralize logs (ELK, Loki, CloudWatch). Use levels deliberately: INFO for state changes, WARN for recoverable issues, ERROR for incidents.
Security
Apply least-privilege IAM, rotate secrets through a vault, validate every input, and patch dependencies on a schedule. For HTTP services, enable TLS everywhere and set sensible security headers.
Testing
Layer unit, integration, and contract tests. Run them in CI on every PR, and add smoke tests post-deploy. For Microservices systems, also run chaos and load tests before a major release.
Reliability & Rollouts
Ship with health checks, readiness probes, graceful shutdown, and a rollback strategy. Prefer canary or blue/green deploys over big-bang releases.
Questions
Frequently asked questions
Is this tutorial up to date?
Yes. This tutorial was last reviewed and updated on June 3, 2026. We revisit popular Microservices tutorials regularly to keep them aligned with current best practices.
What level is this tutorial aimed at?
It is written for working developers with some backend experience. Beginners can still follow along, and senior engineers will find production-grade patterns and trade-off discussions.
Do I need to follow every step in order?
The walkthrough is sequential because each step depends on the previous one. If you only need a specific concept, the table of contents at the top of the article lets you jump straight to that section.
Where can I find the source code?
The full source code is available on GitHub: https://github.com/masterlabsystems/event-driven-kafka-demo. Fork it, run it locally, and adapt it to your own project.
Go deeper
Further reading
Source Code
Get the full project on GitHub
More From the Channel
Follow the full tutorial series on YouTube
The MasterLabSystems channel publishes in-depth, project-based tutorials on Java, Spring Boot, microservices, Docker, Kubernetes, AWS and DevOps — the same topics covered on this site, with full code walkthroughs.
Stay in the Loop
Get the next tutorial in your inbox
next tutorial →
Blue-Green Deployments with Kubernetes — Zero Downtime Releases
Related tutorials
Spring Boot Microservices Architecture Explained Step by Step
A complete, beginner-friendly walkthrough of microservices architecture using Spring Boot — services, gateway, discovery, config and observability.
How to Build a Spring Cloud Config Server
Step-by-step guide to building a centralized configuration server with Spring Cloud Config, Git-backed properties and dynamic refresh.
Service Discovery with Eureka in Spring Boot
How service discovery works, why you need it, and how to set up Netflix Eureka with Spring Cloud step by step.
