Production-Ready Event-Driven Microservices: Go, NATS JetStream, and OpenTelemetry Complete Guide

golang

Production-Ready Event-Driven Microservices: Go, NATS JetStream, and OpenTelemetry Complete Guide

Learn to build production-ready event-driven microservices with Go, NATS JetStream & OpenTelemetry. Master distributed tracing, resilience patterns & cloud deployment.

Aug 15, 2025

Production-Ready Event-Driven Microservices: Go, NATS JetStream, and OpenTelemetry Complete Guide

I’ve been thinking about how modern systems handle complex transactions across distributed services. It started when I noticed how often traditional request-response patterns struggle with scalability and resilience. That’s why I want to share a practical approach using Go, NATS JetStream, and OpenTelemetry for building robust event-driven systems. You’ll see how these technologies work together to create production-ready microservices.

Our journey begins with NATS JetStream configuration. This messaging system provides durable streams and consumer groups, essential for reliable event delivery. Here’s how I initialize the connection:

// internal/common/messaging/jetstream.go
func NewJetStreamClient(config *JetStreamConfig, logger *zap.Logger) (*JetStreamClient, error) {
    opts := []nats.Option{
        nats.MaxReconnects(config.MaxReconnects),
        nats.ReconnectWait(config.ReconnectWait),
        nats.Timeout(config.Timeout),
    }
    
    conn, err := nats.Connect(config.URL, opts...)
    if err != nil {
        return nil, fmt.Errorf("failed to connect to NATS: %w", err)
    }
    
    js, err := conn.JetStream(nats.MaxWait(30*time.Second))
    if err != nil {
        return nil, fmt.Errorf("failed to create JetStream context: %w", err)
    }
    
    return &JetStreamClient{conn: conn, js: js}, nil
}

Why does connection resilience matter? Because network failures happen, and our system must handle them gracefully. The MaxReconnects option ensures we automatically recover from temporary disruptions.

For event publishing, I implement idempotency keys to prevent duplicate processing:

func (c *JetStreamClient) PublishEvent(ctx context.Context, subject string, event interface{}) error {
    eventID := uuid.New().String()
    data, _ := json.Marshal(event)
    
    _, err := c.js.Publish(subject, data, nats.MsgId(eventID))
    return err
}

Notice the MsgId option? That’s our safeguard against duplicate messages. NATS JetStream uses this to deduplicate events automatically.

Now, what about consumers? Here’s a subscription that processes orders with acknowledgment control:

func (c *JetStreamClient) ProcessOrders(ctx context.Context) {
    sub, _ := c.js.PullSubscribe("orders.>", "order-processor")
    
    for {
        msgs, _ := sub.Fetch(10, nats.MaxWait(5*time.Second))
        for _, msg := range msgs {
            var order models.Order
            json.Unmarshal(msg.Data, &order)
            
            // Business logic here
            if processOrder(order) {
                msg.Ack()
            } else {
                msg.Nak()
            }
        }
    }
}

This pattern allows us to explicitly acknowledge successful processing or request redelivery on failures. The Fetch method with batch size 10 balances throughput and resource usage.

For tracing, OpenTelemetry gives us visibility across services. I instrument handlers like this:

// internal/order/handler.go
func CreateOrder(ctx context.Context, payload []byte) {
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "create_order")
    defer span.End()
    
    // Extract trace context from message headers
    prop := otel.GetTextMapPropagator()
    carrier := messaging.NewNATSCarrier(msg.Header)
    ctx = prop.Extract(ctx, carrier)
    
    // Processing logic
    span.AddEvent("Order validation started")
    if valid := validateOrder(payload); !valid {
        span.RecordError(errors.New("invalid order"))
    }
}

How do we connect traces across services? The NATSCarrier propagates trace context through message headers, maintaining the chain.

For resilience, I combine circuit breakers with exponential backoff:

func PaymentHandler(msg *nats.Msg) {
    cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
        Name: "payment-service",
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            return counts.ConsecutiveFailures > 5
        },
    })
    
    _, err := cb.Execute(func() (interface{}, error) {
        return processPayment(msg.Data)
    })
    
    if err != nil {
        backoff.Retry(func() error {
            return retryPayment(msg.Data)
        }, backoff.NewExponentialBackOff())
    }
}

This approach prevents cascading failures while giving temporary issues time to resolve.

Performance optimization comes from worker pools. I use this pattern for high-throughput services:

// internal/notification/worker_pool.go
func StartEmailWorkers(num int) {
    jobs := make(chan *nats.Msg, 100)
    
    for w := 1; w <= num; w++ {
        go func(id int) {
            for msg := range jobs {
                sendEmail(msg.Data)
                msg.Ack()
            }
        }(w)
    }
    
    // Message receiving loop
    sub, _ := js.ChanSubscribe("notifications.email", jobs)
    defer sub.Unsubscribe()
}

Each worker processes messages concurrently, while the channel acts as a buffer during spikes.

When deploying, I add graceful shutdown handling:

func main() {
    server := startHTTPServer()
    
    sig := make(chan os.Signal, 1)
    signal.Notify(sig, syscall.SIGINT, syscall.SIGTERM)
    
    <-sig
    log.Println("Shutdown signal received")
    
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
    if err := server.Shutdown(ctx); err != nil {
        log.Fatal("Forced shutdown:", err)
    }
}

This allows in-flight requests to complete before termination, preventing data loss.

For dead-letter handling, I configure JetStream with explicit failure thresholds:

// Stream configuration
_, err := js.AddStream(&nats.StreamConfig{
    Name: "ORDERS",
    Subjects: []string{"orders.>"},
    MaxDeliver: 5,
    Discard: nats.DiscardOld,
    Duplicates: 5 * time.Minute,
})

Messages exceeding 5 delivery attempts automatically move to a dead-letter queue for investigation.

What metrics should we monitor? These four are critical:

Message processing latency
Error rates per consumer
Pending message counts
Circuit breaker state changes

I’ve found that combining NATS JetStream with OpenTelemetry creates systems that are both robust and observable. The patterns we’ve covered handle real-world challenges like network partitions, processing failures, and traffic spikes. Try implementing these in your next project - you’ll notice how much simpler distributed systems become when events drive your architecture.

If this helped you, share it with others facing similar challenges. Have questions or improvements? Let me know in the comments below.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

golang

Production-Ready Event-Driven Microservices: Go, NATS JetStream, and OpenTelemetry Complete Guide

Our Creations

We are on Medium

Similar Posts

Production-Ready gRPC Microservices: Go, Protocol Buffers, Interceptors, and Advanced Error Handling Guide

Boost Web App Performance: Echo Framework with Redis Integration Guide for Developers

Mastering Cobra and Viper Integration: Build Advanced CLI Apps with Seamless Configuration Management in Go

How to Integrate Fiber with MongoDB Driver for High-Performance Go APIs and Web Applications

Building Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry

Build High-Performance Go Web Apps: Echo Framework + Redis Integration Guide for Developers