Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

Learn to build production-ready event-driven microservices with Go, NATS JetStream, and OpenTelemetry. Master distributed tracing and resilient architectures.

Oct 12, 2025

I’ve spent years building distributed systems, and the shift to event-driven architectures has been one of the most impactful changes in my career. Recently, I found myself rebuilding a monolithic application into microservices, and the challenges of reliability and observability pushed me toward Go, NATS JetStream, and OpenTelemetry. The combination proved so effective that I want to share how you can create production-ready systems that scale gracefully while remaining observable.

Event-driven architectures transform how services communicate. Instead of direct HTTP calls that create tight coupling, services publish events that others can consume asynchronously. This approach improves resilience and scalability. How do you ensure messages aren’t lost when services restart?

Let’s start with NATS JetStream setup. I prefer using Docker Compose for local development. Here’s a basic configuration that includes both NATS and Jaeger for tracing:

services:
  nats:
    image: nats:2.9-alpine
    ports: ["4222:4222", "8222:8222"]
    command: ["--jetstream", "--http_port=8222"]
    volumes: [nats_data:/data]
  
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports: ["16686:16686", "14268:14268"]

Creating a connection manager in Go ensures reliable communication with NATS. This code handles reconnections and stream setup:

type ConnectionManager struct {
    conn *nats.Conn
    js   nats.JetStreamContext
}

func NewConnectionManager(cfg Config) (*ConnectionManager, error) {
    conn, err := nats.Connect(cfg.URL, nats.MaxReconnects(5))
    if err != nil {
        return nil, fmt.Errorf("connection failed: %w", err)
    }
    
    js, err := conn.JetStream()
    if err != nil {
        return nil, fmt.Errorf("JetStream context failed: %w", err)
    }
    
    return &ConnectionManager{conn: conn, js: js}, nil
}

Building an event publisher requires careful attention to message durability. I always use acknowledgments and set reasonable timeouts. What happens if a consumer processes the same event twice?

Here’s a publisher service that handles order creation events:

func (p *Publisher) PublishOrderCreated(ctx context.Context, order Order) error {
    data, err := json.Marshal(order)
    if err != nil {
        return fmt.Errorf("marshal failed: %v", err)
    }
    
    ack, err := p.js.Publish("order.created", data, nats.Context(ctx))
    if err != nil {
        return fmt.Errorf("publish failed: %v", err)
    }
    
    log.Printf("Published message with sequence %d", ack.Sequence)
    return nil
}

Consumer services need to handle events efficiently. I use queue groups for load balancing and durable consumers to survive restarts:

func (c *Consumer) ProcessOrders() error {
    sub, err := c.js.QueueSubscribe("order.created", "inventory-group", 
        func(msg *nats.Msg) {
            var order Order
            if err := json.Unmarshal(msg.Data, &order); err != nil {
                log.Printf("Parse error: %v", err)
                return
            }
            
            if err := c.updateInventory(order); err != nil {
                log.Printf("Inventory update failed: %v", err)
                msg.Nak() // Negative acknowledgment for retry
                return
            }
            
            msg.Ack()
        }, nats.Durable("inventory-consumer"))
    
    return err
}

Integrating OpenTelemetry provides crucial visibility into distributed workflows. Have you ever struggled to trace a request across multiple services?

This code adds tracing to event handlers:

func (h *Handler) handleOrderCreated(ctx context.Context, msg *nats.Msg) {
    tracer := otel.Tracer("order-processor")
    ctx, span := tracer.Start(ctx, "handleOrderCreated")
    defer span.End()
    
    // Add attributes to the span
    span.SetAttributes(
        attribute.String("message.subject", msg.Subject),
        attribute.Int("message.size", len(msg.Data)),
    )
    
    // Process the message
    if err := h.processOrder(ctx, msg.Data); err != nil {
        span.RecordError(err)
        msg.Nak()
        return
    }
    
    msg.Ack()
}

Error handling requires thoughtful retry logic. I implement exponential backoff for failed messages:

func withRetry(fn func() error, maxAttempts int) error {
    for attempt := 1; attempt <= maxAttempts; attempt++ {
        err := fn()
        if err == nil {
            return nil
        }
        
        if attempt == maxAttempts {
            return fmt.Errorf("final attempt failed: %w", err)
        }
        
        backoff := time.Duration(math.Pow(2, float64(attempt))) * time.Second
        time.Sleep(backoff)
    }
    return nil
}

Testing event-driven systems demands mocking external dependencies. I use testcontainers for integration tests:

func TestOrderProcessing(t *testing.T) {
    ctx := context.Background()
    natsContainer, err := setupNATSContainer(ctx)
    if err != nil {
        t.Fatalf("Container setup failed: %v", err)
    }
    defer natsContainer.Terminate(ctx)
    
    // Test logic here
}

Deploying to production involves monitoring key metrics. I track message throughput, error rates, and processing latency. How do you know when your system needs scaling?

Configuration management becomes critical. I use environment variables for connection strings and timeouts:

type Config struct {
    NATSURL        string        `env:"NATS_URL,required"`
    MaxRetries     int           `env:"MAX_RETRIES" envDefault:"3"`
    ProcessTimeout time.Duration `env:"PROCESS_TIMEOUT" envDefault:"30s"`
}

The beauty of this architecture lies in its flexibility. Services can be updated independently, and new consumers can join without affecting existing ones. I’ve seen systems handle 10x load increases with minimal changes.

Building these systems taught me the importance of observability. Without proper tracing, debugging distributed transactions feels like searching for a needle in a haystack.

If you found this helpful, please share it with others who might benefit. I’d love to hear about your experiences in the comments—what challenges have you faced with event-driven systems?

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

Our Creations

We are on Medium

Similar Posts

Echo Redis Integration: Build Lightning-Fast Go Web Apps with Advanced Caching and Session Management

Go CLI Development: Mastering Cobra and Viper Integration for Professional Configuration Management

Go Worker Pool with Graceful Shutdown: Production-Ready Concurrency Patterns and Implementation Guide

How to Integrate Cobra with Viper for Advanced CLI Application Configuration in Go

How to Integrate Echo Framework with OpenTelemetry in Go for Enhanced Application Observability

Go CLI Development: Integrate Cobra with Viper for Advanced Configuration Management and Dynamic Parameter Handling