Building Event-Driven Microservices with NATS, Go, and Kubernetes: Complete Production Guide

golang

Building Event-Driven Microservices with NATS, Go, and Kubernetes: Complete Production Guide

Master building production-ready event-driven microservices with NATS, Go & Kubernetes. Complete guide with JetStream, error handling, monitoring & scaling.

Oct 18, 2025

Building Event-Driven Microservices with NATS, Go, and Kubernetes: Complete Production Guide

I’ve been building distributed systems for over a decade, and I still remember the first time I faced a production outage caused by tightly coupled microservices. That moment sparked my journey into event-driven architecture, and today I want to share a battle-tested approach using NATS, Go, and Kubernetes. This isn’t just theoretical—it’s the same system that now handles millions of events daily across our e-commerce platform.

Why did I choose this specific stack? NATS provides incredible performance with its simple design, Go offers the perfect balance of productivity and control, while Kubernetes gives us the operational maturity needed for production workloads. Together, they create a foundation that scales predictably and handles failures gracefully.

Let me show you how we structure our events. Every message in our system follows a consistent format that includes metadata for tracing and correlation. This consistency pays dividends when debugging distributed workflows.

type EventMetadata struct {
    ID            string            `json:"id"`
    Type          string            `json:"type"`
    Source        string            `json:"source"`
    Timestamp     time.Time         `json:"timestamp"`
    CorrelationID string            `json:"correlation_id"`
}

func NewOrderCreatedEvent(orderID string, items []OrderItem) BaseEvent {
    metadata := NewEventMetadata("order.created", "order-service")
    data := OrderCreatedData{
        OrderID:    orderID,
        Items:      items,
        CreatedAt:  time.Now().UTC(),
    }
    return BaseEvent{Metadata: metadata, Data: data}
}

Have you ever wondered how to ensure messages aren’t lost during network partitions? That’s where NATS JetStream comes in. Unlike traditional message queues, JetStream provides persistence without sacrificing performance. We run it as a three-node cluster in Kubernetes for high availability.

Our NATS configuration focuses on reliability. We set appropriate memory and disk limits, configure cluster routing for redundancy, and use separate accounts for different service types. This isolation prevents one noisy service from affecting others.

jetstream: {
    store_dir: "/data/jetstream"
    max_mem: 1G
    max_file: 10G
}

cluster: {
    name: "ecommerce-cluster"
    routes: [
        "nats://nats-0.nats.default.svc.cluster.local:6222"
        "nats://nats-1.nats.default.svc.cluster.local:6222"
    ]
}

Building the messaging client was a learning experience. I initially underestimated the importance of proper connection handling. Now, our client includes comprehensive reconnection logic and proper resource cleanup.

func NewNATSClient(config NATSConfig) (*NATSClient, error) {
    opts := []nats.Option{
        nats.MaxReconnects(5),
        nats.ReconnectWait(2 * time.Second),
        nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
            slog.Error("NATS disconnected", "error", err)
        }),
    }
    
    nc, err := nats.Connect(nats.DefaultURL, opts...)
    if err != nil {
        return nil, fmt.Errorf("connect failed: %w", err)
    }
    
    js, err := jetstream.New(nc)
    if err != nil {
        return nil, fmt.Errorf("jetstream init failed: %w", err)
    }
    
    return &NATSClient{conn: nc, js: js}, nil
}

What separates production-ready services from prototypes? Error handling and observability. We instrument every service with OpenTelemetry for distributed tracing and structured logging. When something goes wrong—and it will—we can trace the entire flow across service boundaries.

Our consumer services use durable consumers with explicit acknowledgments. This pattern ensures messages are processed exactly once, even when services restart. We’ve found that choosing the right acknowledgment mode depends on your delivery guarantees.

func (s *OrderService) processPaymentEvents(ctx context.Context) error {
    consumer, err := s.js.CreateOrUpdateConsumer(ctx, "ORDERS", jetstream.ConsumerConfig{
        Durable:   "order-payment-processor",
        AckPolicy: jetstream.AckExplicitPolicy,
    })
    
    messages, err := consumer.Messages()
    for msg := range messages {
        if err := s.handlePaymentMessage(msg); err != nil {
            slog.Error("Failed processing message", "error", err)
            continue
        }
        msg.Ack()
    }
    return nil
}

Kubernetes deployment taught us valuable lessons about resource management. We use horizontal pod autoscaling based on NATS queue length and CPU usage. Each service includes liveness and readiness probes that check both the application health and NATS connection status.

Monitoring event-driven systems requires different thinking. Instead of watching request rates, we monitor event throughput, processing latency, and dead letter queues. We built custom Grafana dashboards that show the entire event flow—from producers through to consumers.

How do you test such asynchronous systems? We combine unit tests for business logic with integration tests that spin up real NATS servers. Our test containers include scenarios for network partitions and service failures to ensure our retry mechanisms work as expected.

After several production deployments, I can share some hard-earned wisdom. Always start with simple retry logic before implementing complex patterns. Use correlation IDs religiously—they’re worth their weight in gold during incident investigation. And never underestimate the power of good documentation for your event schemas.

The beauty of this architecture shines during peak loads. When holiday traffic hit our platform last year, the system scaled seamlessly because events buffered naturally in NATS, preventing cascading failures. Services could process at their own pace without dropping requests.

I’m passionate about this approach because it creates systems that are both robust and understandable. The clear separation between services, combined with well-defined events, makes the entire system easier to maintain and evolve over time.

If this guide helps you build better systems, I’d love to hear about your experiences. Please share this with colleagues who might benefit, and leave a comment about your own event-driven journey. Your insights could help others in our community navigate similar challenges.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

golang

Building Event-Driven Microservices with NATS, Go, and Kubernetes: Complete Production Guide

Our Creations

We are on Medium

Similar Posts

Master Cobra-Viper Integration: Build Enterprise-Grade CLI Tools with Advanced Configuration Management in Go

Master Cobra and Viper Integration: Build Professional Go CLI Apps with Advanced Configuration Management

How to Integrate Echo Framework with OpenTelemetry for High-Performance Go Microservices Observability

Production-Ready gRPC Microservices in Go: Authentication, Load Balancing, and Complete Observability Guide

How to Simplify Go Dependency Management with Uber Fx and Zap Logging

Production-Ready Event-Driven Microservices: Go, NATS JetStream, and OpenTelemetry Complete Guide