Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

Learn to build production-ready event-driven microservices with Go, NATS JetStream & OpenTelemetry. Complete guide with code examples, best practices & deployment tips.

Oct 2, 2025

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

I’ve spent the last few years building microservices that needed to handle millions of events daily, and I kept hitting the same walls—message loss during failures, opaque debugging sessions, and serialization bottlenecks. That’s why I decided to write about combining Go, NATS JetStream, and OpenTelemetry. This stack transformed how I approach event-driven systems, making them resilient and observable from day one. If you’re tired of midnight debugging calls, stick around—this might change your production game too.

Event-driven architectures shift how services communicate. Instead of direct API calls, services emit events when something meaningful happens. Other services listen and react independently. This loose coupling lets teams deploy faster and scale components separately. But what happens when a service goes down mid-event processing? That’s where persistence becomes critical.

NATS JetStream adds durable messaging to NATS’s high-performance core. It ensures messages survive restarts and network blips. Setting up a stream is straightforward. You define subjects, retention policies, and replication. Here’s how I configure a production stream in Go:

streamConfig := &nats.StreamConfig{
    Name:      "EVENTS",
    Subjects:  []string{"events.>"},
    Retention: nats.LimitsPolicy,
    MaxAge:    24 * time.Hour,
    Replicas:  3,
}
_, err := js.AddStream(streamConfig)
if err != nil {
    return fmt.Errorf("stream setup failed: %w", err)
}

This creates a stream that keeps events for 24 hours across three replicas. Why might you choose a limits policy over interest-based retention? It depends on whether you need all events stored or only those with active consumers.

Protocol Buffers define your event contracts. They’re faster and more compact than JSON. I define events with clear versioning from the start:

message UserRegisteredEvent {
    EventMetadata metadata = 1;
    string user_id = 2;
    string email = 3;
    int32 schema_version = 4;
}

Generating Go code from this ensures type safety across services. How do you handle schema changes without breaking consumers? I add a version field and avoid removing fields—new services ignore what they don’t need.

Publishing events requires careful error handling. I use retries with exponential backoff and dead-letter queues for problematic messages:

func (p *EventPublisher) Publish(ctx context.Context, event []byte) error {
    span := trace.SpanFromContext(ctx)
    defer span.End()
    
    for i := 0; i < maxRetries; i++ {
        ack, err := p.js.PublishAsync("events.user.registered", event)
        if err == nil {
            select {
            case <-ack.Ok():
                return nil
            case <-time.After(5 * time.Second):
                continue
            }
        }
        time.Sleep(time.Duration(math.Pow(2, float64(i))) * time.Second)
    }
    return p.sendToDLQ(ctx, event)
}

This approach catches most transient failures. But what about guaranteed ordering? JetStream supports ordered consumers when you need it.

Consuming events demands equal attention. I use pull subscribers with acknowledgment modes:

sub, _ := js.PullSubscribe("events.user.>", "notification-service")
for {
    msgs, _ := sub.Fetch(10, nats.MaxWait(5*time.Second))
    for _, msg := range msgs {
        if process(msg) {
            msg.Ack()
        } else {
            msg.Nak() // Triggers redelivery
        }
    }
}

How do you prevent the same event from being processed twice? I include a unique event ID and check it against a processed events cache.

Observability separates working systems from debuggable ones. OpenTelemetry provides distributed tracing across service boundaries. I instrument both publishing and consuming:

func processEvent(ctx context.Context, msg *nats.Msg) {
    ctx, span := tracer.Start(ctx, "process-event")
    defer span.End()
    
    span.SetAttributes(
        attribute.String("event.type", "user_registered"),
        attribute.String("message.id", msg.Header.Get("trace-id")),
    )
    // Process event...
}

Traces connect event flow from producers through queues to consumers. When an email fails to send, I see exactly where it broke. Have you ever traced a message across five services? It turns debugging hours into minutes.

Health checks and graceful shutdowns prevent data loss during deployments. Each service exposes a /health endpoint and listens for termination signals:

server := &http.Server{Addr: ":8080"}
go func() {
    if err := server.ListenAndServe(); err != nil {
        log.Error().Err(err).Msg("Server stopped")
    }
}()

quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
<-quit

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
server.Shutdown(ctx)
publisher.Close() // Flushes pending messages

This ensures in-flight events complete before shutdown. What’s your strategy for zero-downtime deployments?

Deploying to production involves monitoring key metrics—message throughput, processing latency, and error rates. I expose these via Prometheus and visualize traces in Jaeger. Docker Compose ties everything together locally:

services:
  nats:
    image: nats:jetstream
    command: ["-js"]
  jaeger:
    image: jaegertracing/all-in-one:latest

This setup mirrors production, catching issues early. How often do you test your failure scenarios before deploying?

Building event-driven microservices requires balancing performance with reliability. Go’s concurrency model fits perfectly with JetStream’s streaming capabilities. OpenTelemetry makes the entire system transparent. Start small—add tracing to one service, then expand. The investment pays off during incidents.

I’d love to hear about your experiences with event-driven systems. What challenges have you faced? Share your thoughts in the comments below, and if this helped, please like and share it with your team. Let’s build more resilient systems together.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

Our Creations

We are on Medium

Similar Posts

Building Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

Production-Ready gRPC Microservices with Go: Server Streaming, JWT Authentication, and Kubernetes Deployment Guide

Building Production-Ready gRPC Microservices in Go: Service Communication, Load Balancing, and Observability Guide

Build High-Performance Event-Driven Microservices with Go, NATS JetStream and OpenTelemetry Guide

How to Integrate Echo with Redis Using go-redis for High-Performance Web Applications

Build Production-Ready Event-Driven Microservices with NATS, Go, and Kubernetes: Complete Tutorial