I’ve spent years building distributed systems, and one challenge always stands out: reliably handling massive event streams. Recently, while designing a global e-commerce platform, I needed to process 50,000+ orders per second without losing a single transaction. That’s when I turned to Apache Kafka and Go. The combination of Kafka’s durable event streaming and Go’s concurrency model creates an unstoppable force for high-throughput systems. Let me show you how to build production-ready solutions that won’t buckle under pressure.
Kafka organizes data into topics divided into partitions. Each partition maintains strict message order. Producers write to partitions while consumers read in groups. But how do you prevent message loss when servers crash? Partition replication is Kafka’s answer - multiple brokers hold copies of your data. When designing your event schema, include these essentials: unique ID, event type, timestamp, and version.
type BaseEvent struct {
ID string `json:"id"`
Type string `json:"type"`
Timestamp time.Time `json:"timestamp"`
Version string `json:"version"`
}
Setting up your development environment is straightforward. Use Docker Compose to run Kafka locally. This configuration gives you a Kafka broker, Zookeeper, and a web UI:
services:
kafka:
image: confluentinc/cp-kafka:7.4.0
ports: ["9092:9092"]
environment:
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
Initialize your Go project with critical libraries:
go get github.com/IBM/sarama@v1.41.2
go get github.com/prometheus/client_golang@v1.17.0
Now let’s create a high-performance producer. The secret? Asynchronous sending with batching and compression. Sarama’s AsyncProducer handles thousands of messages per second without blocking your application. But what happens when messages fail to deliver? We handle errors in a separate goroutine:
func NewProducer(brokers []string) (sarama.AsyncProducer, error) {
config := sarama.NewConfig()
config.Producer.Return.Successes = true
config.Producer.Compression = sarama.CompressionSnappy
config.Producer.Flush.Messages = 500
producer, err := sarama.NewAsyncProducer(brokers, config)
if err != nil {
return nil, err
}
go func() {
for {
select {
case success := <-producer.Successes():
log.Printf("Delivered msg to %s/%d", success.Topic, success.Partition)
case err := <-producer.Errors():
log.Printf("Delivery failed: %v", err)
}
}
}()
return producer, nil
}
For consumers, we use consumer groups. Each service instance joins the group and gets assigned partitions. If one instance fails, Kafka automatically rebalances the partitions. How do we ensure we don’t lose messages during rebalances? Careful offset management is key:
func StartConsumerGroup(topics []string, groupID string) {
config := sarama.NewConfig()
config.Consumer.Offsets.Initial = sarama.OffsetOldest
config.Consumer.Group.Rebalance.GroupStrategies = []sarama.BalanceStrategy{
sarama.BalanceStrategyRange,
}
group, err := sarama.NewConsumerGroup(brokers, groupID, config)
if err != nil {
log.Fatal(err)
}
handler := &ConsumerHandler{}
for {
err := group.Consume(context.Background(), topics, handler)
if err != nil {
log.Printf("Consume error: %v", err)
}
}
}
Exactly-once processing is achievable with Kafka transactions. Wrap your produce operations in a transaction and commit consumer offsets atomically:
producer.BeginTxn()
producer.Input() <- &sarama.ProducerMessage{Topic: "orders", Value: orderMsg}
producer.SendOffsetsToTxn(offsets, "payment-group")
producer.CommitTxn()
For dead letter queues, create a dedicated topic. When message processing fails repeatedly, route it to the DLQ:
if processAttempts > 3 {
dlqMsg := &sarama.ProducerMessage{
Topic: "dead_letters",
Value: sarama.ByteEncoder(msg.Value),
}
producer.Input() <- dlqMsg
}
Monitoring is non-negotiable in production. Track these critical metrics:
- Message production rate
- Consumer lag
- Error rates
- Processing latency
Here’s how we instrument our consumer with Prometheus:
var (
messagesProcessed = promauto.NewCounter(prometheus.CounterOpts{
Name: "kafka_messages_processed_total",
Help: "Total processed messages",
})
)
func (h *Handler) ConsumeClaim(session sarama.ConsumerGroupSession,
claim sarama.ConsumerGroupClaim) error {
for msg := range claim.Messages() {
messagesProcessed.Inc()
// Processing logic
}
return nil
}
When deploying to Kubernetes, configure liveness probes and resource limits. This deployment YAML ensures our service stays healthy:
containers:
- name: order-service
image: my-registry/order-service:v1.2
resources:
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /healthz
port: 8080
Performance tuning separates good systems from great ones. For producers, increase batch size and enable compression. For consumers, adjust fetch sizes and process messages concurrently:
config.Consumer.Fetch.Min = 1_000_000 // 1MB
config.Consumer.Fetch.Default = 10_000_000 // 10MB
I recently optimized a system from 5,000 to 85,000 messages/second by:
- Switching to Snappy compression
- Increasing producer batch size to 1MB
- Using parallel processing per partition
- Tuning consumer fetch parameters
What bottlenecks might you encounter? Common culprits include serialization overhead and network saturation. Profile your services with pprof during load tests.
Building event-driven systems with Kafka and Go has transformed how I design resilient architectures. The patterns we’ve covered will handle your most demanding workloads. If this guide solved real problems for you, please share it with your team. Have questions about specific implementation details? Let’s discuss in the comments!