Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry Guide

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry Guide

Learn to build scalable event-driven microservices with Go, NATS JetStream, and OpenTelemetry. Complete guide with production patterns, tracing, and deployment.

Oct 4, 2025

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry Guide

I’ve been reflecting on the challenges of building scalable and reliable microservices in today’s distributed systems. Recently, I found myself grappling with how to handle event-driven architectures effectively, especially when it comes to production readiness. That’s why I decided to share my journey in building event-driven microservices using Go, NATS JetStream, and OpenTelemetry. Let me walk you through the key aspects that make this combination powerful.

Event-driven architectures shift how services communicate. Instead of direct calls, services emit events that others react to. This approach improves scalability and resilience. Have you ever considered how this impacts system design when services need to handle thousands of events per second?

Go’s concurrency model makes it ideal for high-throughput systems. Goroutines and channels allow efficient message processing. NATS JetStream adds persistence and guarantees, while OpenTelemetry provides visibility into distributed workflows.

Let’s start with event definitions. Clear event schemas are crucial for consistency across services.

type Event struct {
    Metadata EventMetadata `json:"metadata"`
    Payload  interface{}   `json:"payload"`
}

type OrderCreatedPayload struct {
    OrderID    string     `json:"order_id"`
    CustomerID string     `json:"customer_id"`
    Items      []OrderItem `json:"items"`
    TotalAmount float64   `json:"total_amount"`
}

Each event includes metadata for tracing and correlation. This structure helps in debugging and monitoring. How do you ensure events remain backward compatible as your system evolves?

Producing events requires reliability. I use a publisher interface that handles connection management and retries.

func (p *JetStreamPublisher) Publish(ctx context.Context, event *models.Event) error {
    data, err := json.Marshal(event)
    if err != nil {
        return fmt.Errorf("marshal event: %w", err)
    }
    
    subject := fmt.Sprintf("events.%s", event.Metadata.EventType)
    _, err = p.js.PublishAsync(subject, data)
    if err != nil {
        return fmt.Errorf("publish event: %w", err)
    }
    
    return nil
}

This code marshals events and publishes them to JetStream. Async publishing improves throughput but requires careful error handling.

Consuming events demands similar attention. Services must process messages without losing data or creating bottlenecks.

func (c *EventConsumer) Start(ctx context.Context) error {
    sub, err := c.js.PullSubscribe("events.>", "order-service")
    if err != nil {
        return err
    }
    
    for {
        msgs, err := sub.Fetch(10, nats.MaxWait(5*time.Second))
        if err != nil {
            if err == nats.ErrTimeout {
                continue
            }
            return err
        }
        
        for _, msg := range msgs {
            go c.processMessage(ctx, msg)
        }
    }
}

Pull subscriptions allow controlling message flow. Processing messages in goroutines prevents blocking. What strategies do you use to handle peak loads without overwhelming your services?

Distributed tracing with OpenTelemetry reveals how events move through the system. It connects related operations across service boundaries.

func processOrder(ctx context.Context, order models.Order) error {
    ctx, span := otel.Tracer("order-service").Start(ctx, "processOrder")
    defer span.End()
    
    span.SetAttributes(
        attribute.String("order.id", order.ID),
        attribute.Float64("order.amount", order.TotalAmount),
    )
    
    event := createOrderCreatedEvent(order)
    return publisher.Publish(ctx, event)
}

Spans track function execution and add context to events. This visibility is invaluable when diagnosing issues in production.

Error handling must be proactive. Dead letter queues capture failed messages for later analysis.

func (c *EventConsumer) processMessage(ctx context.Context, msg *nats.Msg) {
    var event models.Event
    if err := json.Unmarshal(msg.Data, &event); err != nil {
        c.sendToDLQ(msg, "invalid format")
        msg.Nak()
        return
    }
    
    if err := c.handler.Handle(ctx, event); err != nil {
        c.sendToDLQ(msg, err.Error())
        msg.Nak()
        return
    }
    
    msg.Ack()
}

Unacknowledged messages are redelivered, while acknowledged ones are removed. This pattern ensures at-least-once processing.

Health checks and graceful shutdowns maintain system stability. Services should stop cleanly without losing in-flight messages.

func (s *Service) healthCheck() http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        if s.natsConn.Status() != nats.CONNECTED {
            http.Error(w, "NATS disconnected", http.StatusServiceUnavailable)
            return
        }
        w.WriteHeader(http.StatusOK)
    }
}

Regular health checks verify dependencies. During shutdown, services finish current work before exiting.

Testing event-driven systems involves verifying event production and consumption. I use in-memory NATS servers for isolated tests.

func TestOrderCreation(t *testing.T) {
    nc, js := setupTestJetStream(t)
    defer nc.Close()
    
    publisher := messaging.NewJetStreamPublisher(js)
    service := NewOrderService(publisher)
    
    order := models.Order{ID: "test-123"}
    err := service.CreateOrder(context.Background(), order)
    assert.NoError(t, err)
    
    // Verify event was published
    msg, err := js.GetMsg("events.order.created", 1)
    assert.NoError(t, err)
    assert.Contains(t, string(msg.Data), "test-123")
}

Mocking external services ensures tests remain fast and reliable. How do you balance test coverage with execution speed in your projects?

Deployment requires monitoring and alerting. Prometheus metrics and Grafana dashboards track system health.

func recordMetrics() {
    ordersProcessed := prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "orders_processed_total",
            Help: "Total number of orders processed",
        },
        []string{"status"},
    )
    prometheus.MustRegister(ordersProcessed)
}

Metrics provide real-time insights into system performance. Alerts notify teams of anomalies before they impact users.

Building production-ready systems is an iterative process. Start simple, add observability, and gradually introduce complexity. I’ve found that focusing on clear event contracts and robust error handling pays off in the long run.

I hope this exploration helps you in your own projects. If you found these insights valuable, please like, share, and comment with your experiences or questions. Let’s continue the conversation and learn from each other’s journeys in distributed systems.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry Guide

Our Creations

We are on Medium

Similar Posts

Building Event-Driven Microservices with Go, NATS JetStream and OpenTelemetry for Production

How to Integrate Cobra CLI with Viper Configuration Management in Go Applications

Production-Ready gRPC Microservices in Go: Authentication, Service Discovery, and Observability Guide

Build Event-Driven Microservices with NATS, Go and Distributed Tracing: Complete Tutorial with OpenTelemetry

Building Production-Ready Event-Driven Microservices with Go, NATS Streaming, and Docker: Complete Tutorial

Echo Framework Redis Integration: Complete Guide to Session Management and High-Performance Caching