Building Production-Ready Event-Driven Microservices: Go, NATS JetStream, OpenTelemetry Guide

golang

Building Production-Ready Event-Driven Microservices: Go, NATS JetStream, OpenTelemetry Guide

Learn to build production-ready event-driven microservices with Go, NATS JetStream & OpenTelemetry. Complete tutorial with real examples & best practices.

Sep 2, 2025

Building Production-Ready Event-Driven Microservices: Go, NATS JetStream, OpenTelemetry Guide

I’ve been thinking a lot lately about how modern systems handle the constant flow of events while remaining observable and resilient. The challenge isn’t just processing messages—it’s doing so reliably at scale while maintaining visibility into what’s happening across dozens of services. That’s why I want to share a practical approach using Go, NATS JetStream, and OpenTelemetry.

When we talk about production-ready systems, we’re discussing more than just working code. We’re talking about systems that can handle failures gracefully, scale under pressure, and provide clear visibility when things go wrong. How do you ensure a message isn’t lost when a service restarts? What happens when your payment processor takes too long to respond?

Let me show you how we build our messaging foundation. We start with a robust NATS client that handles connections, reconnections, and stream setup:

func NewNATSClient(url string, logger *zap.Logger) (*NATSClient, error) {
    opts := []nats.Option{
        nats.ReconnectWait(2 * time.Second),
        nats.MaxReconnects(10),
        nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
            logger.Error("NATS disconnected", zap.Error(err))
        }),
    }
    
    nc, err := nats.Connect(url, opts...)
    if err != nil {
        return nil, fmt.Errorf("failed to connect to NATS: %w", err)
    }
    
    js, err := nc.JetStream()
    if err != nil {
        return nil, fmt.Errorf("failed to create JetStream: %w", err)
    }
    
    return &NATSClient{conn: nc, js: js, logger: logger}, nil
}

This setup gives us automatic reconnection with exponential backoff and proper logging. But what about message durability? That’s where JetStream’s persistent storage comes in. We configure streams to retain messages even if consumers go offline:

func (nc *NATSClient) setupOrderStream() error {
    _, err := nc.js.AddStream(&nats.StreamConfig{
        Name:     "ORDERS",
        Subjects: []string{"orders.*"},
        Retention: nats.WorkQueuePolicy,
        Storage:   nats.FileStorage,
        MaxAge:    24 * time.Hour,
    })
    return err
}

Now let’s talk about observability. Have you ever tried to trace a request through multiple services? OpenTelemetry makes this straightforward. We instrument our message publishing to automatically propagate trace context:

func (nc *NATSClient) PublishWithTrace(ctx context.Context, subject string, data []byte) error {
    ctx, span := nc.tracer.Start(ctx, "nats.publish")
    defer span.End()

    headers := make(nats.Header)
    propagator := propagation.TraceContext{}
    propagator.Inject(ctx, propagation.HeaderCarrier(headers))

    msg := &nats.Msg{
        Subject: subject,
        Data:    data,
        Header:  headers,
    }

    _, err := nc.js.PublishMsg(msg)
    if err != nil {
        span.RecordError(err)
        return err
    }

    return nil
}

This automatic trace propagation means we can follow a single order through validation, inventory checks, payment processing, and notification—all in one visual trace. It’s incredibly powerful when debugging complex flows.

Error handling deserves special attention. In distributed systems, failures are inevitable. We implement retry logic with exponential backoff:

func withRetry(fn func() error, maxAttempts int) error {
    var err error
    for attempt := 1; attempt <= maxAttempts; attempt++ {
        err = fn()
        if err == nil {
            return nil
        }
        
        if attempt == maxAttempts {
            break
        }
        
        backoff := time.Duration(math.Pow(2, float64(attempt))) * time.Second
        time.Sleep(backoff)
    }
    return fmt.Errorf("after %d attempts: %w", maxAttempts, err)
}

But what happens when even retries fail? We need dead-letter queues and proper monitoring. That’s where our observability setup pays dividends—we can alert on failed messages and manually reprocess when necessary.

Protocol Buffers give us efficient serialization and schema evolution. Here’s how we define our core events:

message OrderCreated {
  string order_id = 1;
  string customer_id = 2;
  repeated OrderItem items = 3;
  double total_amount = 4;
  google.protobuf.Timestamp created_at = 5;
  string trace_id = 6;
}

The beauty of this approach is how everything connects. Tracing links our services, JetStream ensures message durability, and Go’s efficiency keeps our resource usage low. We get production-ready resilience without sacrificing developer productivity.

Deployment considerations matter too. We use health checks and graceful shutdown to ensure zero-downtime deployments:

func (s *OrderService) HealthCheck() http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        if s.natsClient.IsConnected() && s.db.IsConnected() {
            w.WriteHeader(http.StatusOK)
            return
        }
        w.WriteHeader(http.StatusServiceUnavailable)
    })
}

func (s *OrderService) Shutdown(ctx context.Context) error {
    s.logger.Info("shutting down gracefully")
    
    // Stop accepting new requests
    s.server.Shutdown(ctx)
    
    // Complete current processing
    s.wg.Wait()
    
    // Close connections
    s.natsClient.Close()
    
    return nil
}

This architecture handles backpressure naturally through JetStream’s flow control. When services can’t keep up, messages persist until consumers are ready again. No data loss, no frantic monitoring alerts—just smooth, resilient operation.

The result? A system that scales horizontally, provides excellent observability, and handles real-world production loads. It’s not just theoretical—this pattern handles millions of events daily in production environments.

What questions do you have about implementing this in your own systems? I’d love to hear about your experiences and challenges with event-driven architectures.

If you found this useful, please share it with your team and leave a comment below about your own implementation stories. Let’s keep the conversation going about building better, more resilient systems together.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

golang

Building Production-Ready Event-Driven Microservices: Go, NATS JetStream, OpenTelemetry Guide

Our Creations

We are on Medium

Similar Posts

Master Cobra and Viper Integration: Build Professional Go CLI Tools with Advanced Configuration Management

Build Production-Ready Event-Driven Microservices with NATS, Go, and Kubernetes: Complete Tutorial

Complete Guide: Building Production-Ready Event-Driven Microservices with NATS, Go, and Distributed Tracing

Building Resilient Go Services with Resty and Hystrix-Go Circuit Breakers

Build Production Event-Driven Order Processing: NATS, Go, PostgreSQL Complete Guide with Microservices Architecture

Production-Ready gRPC Microservices: Go, Protocol Buffers, Interceptors, and Advanced Error Handling Guide