Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry Guide

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry Guide

Learn to build scalable event-driven microservices with Go, NATS JetStream & OpenTelemetry. Complete guide with Docker deployment, error handling & testing strategies.

Aug 12, 2025

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry Guide

I’ve been thinking a lot about how modern e-commerce systems handle thousands of transactions without collapsing under pressure. What makes them resilient? How do they track orders across distributed services? This led me to explore event-driven architectures using Go, NATS JetStream, and OpenTelemetry. Let’s walk through building a production-ready order processing system together.

First, we structure our project with clear separation of concerns. The cmd directory houses our microservices, while internal contains shared components. We define our event contracts using Protocol Buffers - this schema-first approach prevents breaking changes. Notice how each event includes a correlation ID? That’s our golden thread for tracing transactions across services.

protoc --go_out=. --go_opt=paths=source_relative proto/events.proto

Our messaging backbone uses NATS JetStream. Why choose it? Persistent storage, exactly-once delivery, and native horizontal scaling. The connection setup includes crucial production features: automatic reconnects and error handling. Did you know JetStream can handle over 10 million messages per second on modest hardware?

func NewNATSClient(url string) (*NATSClient, error) {
    opts := []nats.Option{
        nats.ReconnectWait(time.Second * 2),
        nats.MaxReconnects(-1),
        nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
            log.Printf("NATS disconnected: %v", err)
        }),
    }
    // ... connection logic
}

When publishing events, we inject OpenTelemetry context directly into message headers. This allows trace propagation across service boundaries. How else could we correlate events in a complex payment failure scenario?

func (nc *NATSClient) PublishEvent(ctx context.Context, subject string, event proto.Message) error {
    // ... tracing setup
    otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(headers))
    msg := &nats.Msg{Subject: subject, Header: headers, Data: data}
    _, err = nc.js.PublishMsg(msg)
}

The order service initiates the workflow by publishing OrderCreated events. But what happens when inventory reservation fails? That’s where our saga orchestrator shines. It coordinates compensating actions across services - reversing reservations, refunding payments, and notifying customers. We implement this using state machines with persistent storage.

For observability, we instrument everything. Metrics track message throughput and error rates, while traces follow transactions across four services. Our health checks integrate with Kubernetes readiness probes:

// internal/health/server.go
func NewServer() *gin.Engine {
    router := gin.Default()
    router.GET("/live", func(c *gin.Context) { c.JSON(200, gin.H{"status": "alive"}) })
    router.GET("/ready", checkDatabaseConnection)
    return router
}

Testing event-driven systems requires simulating failures. We use NATS’ built-in message replay to test edge cases:

// In payment_service_test.go
sub, _ := js.PullSubscribe("PAYMENT", "payment-test", nats.BindStream("ORDERS"))
msgs, _ := sub.Fetch(1, nats.MaxWait(2*time.Second))
msg := msgs[0]
// Simulate processing failure
_ = msg.NakWithDelay(time.Minute) 
// Verify redelivery occurs

Containerization ensures consistency from development to production. Our Docker Compose file spins up NATS, Jaeger, and Prometheus alongside services. Resource limits prevent cascading failures - payment service gets CPU priority during peak loads.

Deploying this? Start small. Run NATS in clustered mode first. Gradually add services while monitoring OpenTelemetry metrics. Remember to set JetStream retention policies matching your business needs - 7 days for orders, 30 days for audits.

The real magic happens when components interact. An order flows through reservation, payment, and notification services - each step emitting events. If payment fails, the saga rolls back inventory reservations within seconds. Customers get real-time updates while our system maintains consistency.

What separates this from a basic tutorial? Production-grade patterns:

Idempotent message processing
Exponential backoff retries
Trace context propagation
Resource-based health checks
Schema versioning

I encourage you to try implementing the inventory service yourself. How would you handle concurrent reservations for limited stock? Share your approach in the comments!

If you found this useful, please like and share. I’d love to hear about your event-driven architecture challenges. What patterns have worked best in your projects?

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

golang

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry Guide

Our Creations

We are on Medium

Similar Posts

How to Integrate Echo with Redis for Lightning-Fast Web Applications in Go

Master Cobra CLI and Viper Integration: Build Professional Go Command-Line Tools with Advanced Configuration Management

Build Production-Ready Event-Driven Microservices with Go, NATS JetStream, and OpenTelemetry: Complete Guide

Complete Guide to Integrating Cobra CLI Framework with Viper Configuration Management in Go

How to Integrate Echo with OpenTelemetry for Production-Ready Go Microservices Monitoring and Distributed Tracing

Boost Web Performance: Echo + Redis Integration Guide for Lightning-Fast Go Applications