Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

Learn to build production-ready event-driven microservices using Go, NATS JetStream & OpenTelemetry. Complete guide with architecture patterns, observability & deployment.

Sep 12, 2025

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

I’ve been thinking a lot lately about building microservices that can handle real production workloads. You know, the kind that don’t just work in development but actually survive the chaos of real-world deployment. That’s why I want to share my approach to creating event-driven systems using Go, NATS JetStream, and OpenTelemetry.

Have you ever wondered what separates a prototype from a production-ready system? It’s not just about making things work—it’s about building something that can handle failure gracefully, scale when needed, and provide clear visibility when things go wrong.

Let me show you how I structure these systems. First, we establish a solid foundation with NATS JetStream. This isn’t just about sending messages—it’s about creating durable streams that persist events and ensure they’re processed exactly as intended.

// Setting up our JetStream connection with proper error handling
nc, err := nats.Connect("nats://localhost:4222",
    nats.MaxReconnects(10),
    nats.ReconnectWait(2*time.Second),
    nats.DisconnectHandler(func(c *nats.Conn) {
        log.Printf("Disconnected from NATS")
    }),
    nats.ReconnectHandler(func(c *nats.Conn) {
        log.Printf("Reconnected to %s", c.ConnectedUrl())
    }))
if err != nil {
    return fmt.Errorf("connection failed: %w", err)
}

js, err := nc.JetStream(nats.PublishAsyncMaxPending(256))
if err != nil {
    return fmt.Errorf("jetstream context failed: %w", err)
}

What happens when your service suddenly gets flooded with messages? Or when downstream services become temporarily unavailable? That’s where proper stream configuration comes into play.

// Creating a durable stream with replication
_, err = js.AddStream(&nats.StreamConfig{
    Name:        "ORDERS",
    Subjects:    []string{"order.*"},
    Retention:   nats.InterestPolicy,
    MaxMsgs:     1000000,
    MaxBytes:    1 * 1024 * 1024 * 1024, // 1GB
    Storage:     nats.FileStorage,
    Replicas:    3,
    Duplicates:  2 * time.Minute,
})

Now, let’s talk about something crucial: observability. How can you trace a request as it jumps between services? OpenTelemetry provides the answer, giving you clear visibility across your entire system.

// Instrumenting our service with OpenTelemetry
func publishOrderEvent(ctx context.Context, js nats.JetStreamContext, order Order) error {
    ctx, span := tracer.Start(ctx, "publishOrderEvent")
    defer span.End()
    
    span.SetAttributes(
        attribute.String("order.id", order.ID),
        attribute.String("order.status", order.Status),
    )
    
    eventData, err := json.Marshal(order)
    if err != nil {
        span.RecordError(err)
        return fmt.Errorf("marshal failed: %w", err)
    }
    
    msg := &nats.Msg{
        Subject: "order.created",
        Data:    eventData,
        Header:  nats.Header{},
    }
    
    // Inject tracing context into message headers
    otel.GetTextMapPropagator().Inject(ctx, natsHeaderCarrier{msg.Header})
    
    ack, err := js.PublishMsg(msg)
    if err != nil {
        span.RecordError(err)
        return fmt.Errorf("publish failed: %w", err)
    }
    
    span.SetAttributes(attribute.String("nats.sequence", ack.Sequence))
    return nil
}

Error handling is where many systems fall short. But what if we could automatically retry failed operations with exponential backoff? This approach has saved me countless times during temporary outages.

// Robust message consumption with retry logic
sub, err := js.Subscribe("order.created", func(msg *nats.Msg) {
    ctx := otel.GetTextMapPropagator().Extract(context.Background(), natsHeaderCarrier{msg.Header})
    ctx, span := tracer.Start(ctx, "processOrder")
    defer span.End()
    
    var order Order
    if err := json.Unmarshal(msg.Data, &order); err != nil {
        span.RecordError(err)
        msg.Nak() // Negative acknowledgment
        return
    }
    
    if err := processOrder(ctx, order); err != nil {
        span.RecordError(err)
        if shouldRetry(err) {
            msg.NakWithDelay(calculateBackoff(msg.Metadata().NumDelivered))
        } else {
            msg.Term() // Terminal failure
        }
        return
    }
    
    msg.Ack() // Success
}, nats.ManualAck())

Building production-ready systems means thinking about deployment and monitoring from day one. I always include health checks, metrics endpoints, and proper logging from the very beginning.

// Health check endpoint with metrics
func healthHandler(w http.ResponseWriter, r *http.Request) {
    if nc.Status() != nats.CONNECTED {
        http.Error(w, "NATS disconnected", http.StatusServiceUnavailable)
        return
    }
    
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(map[string]interface{}{
        "status":    "healthy",
        "timestamp": time.Now().UTC(),
        "nats":      nc.ConnectedUrl(),
    })
}

The beauty of this approach is how everything works together. Go’s performance characteristics, NATS JetStream’s reliability, and OpenTelemetry’s observability create a foundation you can trust in production.

But here’s something to consider: how do you ensure message ordering while maintaining scalability? The answer often lies in thoughtful partitioning and consumer group strategies.

What I love about this architecture is its resilience. Services can fail, networks can partition, and messages can be delayed—yet the system continues to operate, eventually catching up when conditions improve.

Remember, building for production means anticipating failure at every level. It’s not about preventing errors entirely—that’s impossible—but about creating systems that handle errors gracefully and recover automatically.

I’d love to hear your thoughts on this approach. What challenges have you faced with event-driven architectures? Have you found particular strategies that work well in your production environments? Share your experiences in the comments below—let’s learn from each other’s journeys.

If this resonates with you, please like and share this with others who might benefit from these patterns. Your feedback helps shape future content and discussions around building reliable systems.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

Our Creations

We are on Medium

Similar Posts

Build Production-Ready Event-Driven Microservices with Go, NATS, and OpenTelemetry: Complete Guide

How to Securely Call External APIs in Go Using Resty and Vault

Production-Ready gRPC Microservices with Go: Service Mesh Integration, Observability, and Deployment Guide

Building a Distributed Rate Limiter with Redis and Go: Production-Ready Patterns for High-Scale Apps

Echo Redis Integration Guide: Build Lightning-Fast Go Web Applications with In-Memory Caching

Building Event-Driven Microservices with NATS, Go, and Docker: Complete Production Guide