Building Production-Ready Event-Driven Microservices with Go, NATS, and OpenTelemetry Guide

golang

Building Production-Ready Event-Driven Microservices with Go, NATS, and OpenTelemetry Guide

Learn to build production-ready event-driven microservices with Go, NATS & OpenTelemetry. Complete guide with tracing, resilience patterns & deployment.

Jul 25, 2025

Building Production-Ready Event-Driven Microservices with Go, NATS, and OpenTelemetry Guide

Here’s my perspective on building robust event-driven microservices with Go, NATS, and OpenTelemetry:

Lately, I’ve noticed many teams struggling with distributed systems complexity. Why do some microservice implementations become tangled webs of HTTP calls? What if we could create loosely coupled services that scale naturally? That’s why I want to share a practical approach using Go, NATS, and OpenTelemetry - a combination that’s powered several production systems I’ve designed.

When building event-driven architectures, defining clear contracts is crucial. Let’s examine how we establish shared event types:

// internal/common/events/events.go
type BaseEvent struct {
    ID          string            `json:"id"`
    Type        EventType         `json:"type"`
    AggregateID string            `json:"aggregate_id"`
    Timestamp   time.Time         `json:"timestamp"`
}

type OrderCreatedEvent struct {
    BaseEvent
    CustomerID string        `json:"customer_id"`
    Items      []OrderItem   `json:"items"`
}

Notice how each event carries its identity and context. How might this help when services evolve independently? The key is versioned schemas that allow backward-compatible changes.

For communication, NATS provides the messaging backbone. Here’s how we initialize a robust connection:

// internal/common/messaging/nats.go
func NewNATSEventBus(natsURL string) (*NATSEventBus, error) {
    nc, err := nats.Connect(natsURL,
        nats.ReconnectWait(2*time.Second),
        nats.MaxReconnects(-1),
        nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
            log.Printf("Disconnected: %v", err)
        }),
        nats.ReconnectHandler(func(nc *nats.Conn) {
            log.Printf("Reconnected to %s", nc.ConnectedUrl())
        }),
    )
    if err != nil {
        return nil, err
    }
    
    sc, err := stan.Connect(clusterID, clientID, stan.NatsConn(nc))
    return &NATSEventBus{conn: sc}, nil
}

This configuration handles network volatility - essential for production. What happens during a cloud provider outage? The reconnection logic keeps services resilient.

Publishing events with tracing context demonstrates OpenTelemetry integration:

func (b *NATSEventBus) Publish(ctx context.Context, subject string, event interface{}) error {
    _, span := b.tracer.Start(ctx, "Publish:"+subject)
    defer span.End()
    
    data, err := json.Marshal(event)
    if err != nil {
        span.RecordError(err)
        return err
    }
    
    msg := &stan.Msg{
        Subject: subject,
        Data:    data,
    }
    
    // Inject trace context into message headers
    carrier := propagation.MapCarrier{}
    otel.GetTextMapPropagator().Inject(ctx, carrier)
    for k, v := range carrier {
        msg.Header.Set(k, v)
    }
    
    return b.conn.Publish(subject, data)
}

Notice how we propagate trace context through message headers. Why is this vital? It enables tracking requests across service boundaries, turning fragmented logs into coherent journeys.

For message consumers, consider this subscription pattern:

func (b *NATSEventBus) Subscribe(subject string, handler EventHandler) error {
    _, err := b.conn.Subscribe(subject, func(msg *stan.Msg) {
        ctx := context.Background()
        
        // Extract tracing context
        carrier := propagation.MapCarrier{}
        for key := range msg.Header {
            carrier.Set(key, msg.Header.Get(key))
        }
        ctx = otel.GetTextMapPropagator().Extract(ctx, carrier)
        
        _, span := b.tracer.Start(ctx, "Consume:"+subject)
        defer span.End()
        
        if err := handler(ctx, &Message{
            Subject: msg.Subject,
            Data:    msg.Data,
        }); err != nil {
            span.RecordError(err)
        }
    }, stan.DurableName("order-processor"))
    return err
}

This implements durable subscriptions and trace propagation. What if a service crashes mid-processing? The durable name ensures message redelivery when services restart.

For resilience, combine circuit breakers with exponential backoff:

// internal/order/service.go
func (s *OrderService) Process(ctx context.Context, order Order) error {
    cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
        Name:     "order_processor",
        Timeout:  30 * time.Second,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            return counts.ConsecutiveFailures > 5
        },
    })
    
    _, err := cb.Execute(func() (interface{}, error) {
        return nil, s.eventBus.Publish(ctx, "orders.created", order)
    })
    
    if err != nil {
        // Exponential backoff retry
        op := func() error {
            return s.eventBus.Publish(ctx, "orders.retry", order)
        }
        return backoff.Retry(op, backoff.NewExponentialBackOff())
    }
    
    return nil
}

This combination prevents cascading failures while ensuring eventual delivery. How many retries make sense? That depends on your business context - payment processing needs different handling than notifications.

For deployment, our Dockerfile ensures minimal footprints:

FROM golang:1.20-alpine AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o order-service ./cmd/order-service

FROM scratch
COPY --from=builder /app/order-service /order-service
ENTRYPOINT ["/order-service"]

Using scratch images reduces attack surfaces and speeds deployments. But what about runtime dependencies? That’s why we statically compile with CGO disabled.

Finally, OpenTelemetry gives us observability superpowers:

// internal/common/tracing/tracing.go
func InitTracing(serviceName string) (func(), error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint())
    if err != nil {
        return nil, err
    }
    
    tp := tracesdk.NewTracerProvider(
        tracesdk.WithBatcher(exporter),
        tracesdk.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String(serviceName),
        ),
    )
    
    otel.SetTracerProvider(tp)
    return func() { tp.Shutdown(context.Background()) }, nil
}

This instruments all services with Jaeger tracing. How do we correlate events across services? The trace propagation we implemented earlier creates connected timelines in our observability tools.

Building production-ready systems requires balancing simplicity and resilience. Start with well-defined events, add deliberate redundancy, and bake in observability from day one. What challenges have you faced with event-driven architectures? Share your experiences below - I’d love to hear what works for your teams. If this approach resonates, consider sharing it with others facing similar design decisions.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

golang

Building Production-Ready Event-Driven Microservices with Go, NATS, and OpenTelemetry Guide

Our Creations

We are on Medium

Similar Posts

Building Production-Ready Event-Driven Microservices with Go NATS JetStream and OpenTelemetry Complete Guide

Building Event-Driven Microservices with NATS Go and OpenTelemetry Distributed Tracing Guide

How to Integrate Chi Router with OpenTelemetry for Observable Go Web Services and Distributed Tracing

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

Cobra and Viper Integration: Complete Guide to Advanced CLI Configuration Management in Go

Boost Web App Performance: Complete Echo Framework and Redis Integration Guide for Go Developers