Building Production-Ready Event-Driven Microservices with Go NATS and OpenTelemetry

Learn to build scalable event-driven microservices with Go, NATS JetStream & OpenTelemetry. Master distributed tracing, resilience patterns & production deployment.

Building Production-Ready Event-Driven Microservices with Go NATS and OpenTelemetry

I’ve been thinking about building event-driven microservices recently. Why? Because modern applications demand systems that can scale dynamically while maintaining reliability. When services need to communicate across distributed environments, traditional request-response models often fall short. This led me to explore Go, NATS, and OpenTelemetry - a powerful combination for production-grade systems. Let’s walk through what I’ve learned.

Our order processing system will use NATS JetStream as its messaging backbone. Why JetStream? It provides persistent storage and exactly-once delivery semantics. Here’s how we configure a stream:

// Create inventory stream
_, err := nc.js.AddStream(&nats.StreamConfig{
  Name:     "INVENTORY",
  Subjects: []string{"inventory.*"},
  Storage:  nats.FileStorage,
  MaxAge:   24 * time.Hour,
})
if err != nil {
  return fmt.Errorf("stream creation failed: %w", err)
}

Protocol Buffers give us efficient serialization. Notice how we version our events - ever wondered what happens when schemas change?

syntax = "proto3";

message OrderEvent {
  string event_id = 1;
  string order_id = 2;
  string customer_id = 3;
  repeated OrderItem items = 4;
  string status = 5;
  string version = 6;  // Critical for evolution
}

The publisher service uses connection pooling. What happens during network partitions? Our implementation handles reconnections:

func (p *Publisher) PublishOrder(ctx context.Context, event *pb.OrderEvent) error {
  data, err := proto.Marshal(event)
  if err != nil {
    return err
  }

  _, err = p.js.PublishAsync("orders.created", data)
  if err != nil {
    // How would you handle backpressure here?
    p.retryQueue <- event 
    return err
  }
  return nil
}

Consumer services use concurrent workers. Why multiple subscriptions? They prevent head-of-line blocking:

func StartInventoryConsumer(js nats.JetStreamContext) {
  sub, _ := js.PullSubscribe("inventory.check", "inventory-group")
  
  for i := 0; i < 5; i++ { // 5 concurrent workers
    go func(workerID int) {
      for {
        msgs, _ := sub.Fetch(10, nats.MaxWait(5*time.Second))
        for _, msg := range msgs {
          processInventoryMessage(msg)
        }
      }
    }(i)
  }
}

OpenTelemetry traces cross-service calls. See how context propagates through events:

func processOrder(msg *nats.Msg) {
  carrier := propagation.MapCarrier{}
  carrier.Decode(msg.Header)

  ctx := otel.GetTextMapPropagator().Extract(context.Background(), carrier)
  ctx, span := tracer.Start(ctx, "order_processing")
  defer span.End()

  // Processing logic
  span.SetAttributes(attribute.String("order.id", orderID))
}

For resilience, we implement circuit breakers. What thresholds make sense for your workload?

breaker := gobreaker.NewCircuitBreaker(gobreaker.Settings{
  Name:        "payment-service",
  Timeout:     30 * time.Second,
  ReadyToTrip: func(counts gobreaker.Counts) bool {
    return counts.ConsecutiveFailures > 5
  },
})

result, err := breaker.Execute(func() (interface{}, error) {
  return paymentClient.Process(order)
})

Testing event-driven systems requires simulating failures. Try intentionally breaking your consumer - does it recover gracefully?

func TestOrderProcessing_Retry(t *testing.T) {
  mockSvc := &MockInventoryService{shouldFail: true}
  processor := NewProcessor(mockSvc)

  // First attempt should fail
  err := processor.HandleOrder(testOrder)
  require.Error(t, err)

  // Reset failure mode
  mockSvc.shouldFail = false

  // Retry should succeed
  err = processor.RetryFailedOrders()
  require.NoError(t, err)
}

Deployment needs health checks. Notice the /ready endpoint:

FROM golang:1.19-alpine
COPY . /app
RUN go build -o /order-service

HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:8080/ready || exit 1

CMD ["/order-service"]

Performance tuning revealed surprising bottlenecks. Enabling JetStream compression cut message size by 60%:

_, err := js.AddStream(&nats.StreamConfig{
  Name: "NOTIFICATIONS",
  Compression: nats.S2Compression, // Efficient compression
})

This journey taught me that production readiness isn’t about single components - it’s how they interact. What challenges have you faced with distributed systems? I’d love to hear your experiences - share your thoughts in the comments below. If this resonated with you, consider sharing it with others facing similar architectural decisions.

// Our Network

More from our team

Explore our publications across finance, culture, tech, and beyond.

// More Articles

Similar Posts