The Hidden Cost of Overfetching — How a NAT Gateway Spike Led Us Down a Rabbit Hole
The alert came in on a Tuesday. NAT gateway inbound traffic had spiked to over 100MB/s. Nothing had changed in the deployment topology — no new features, no traffic growth, no config changes we were aware of. Just a migration to new infrastructure.
That “nothing changed” is always a lie, of course. Something always changes. The question is finding it.
This is the story of two overfetching bugs that had been hiding in production for months, quietly burning memory and network bandwidth, waiting for a migration to make them visible.
The System
Before diving in, here’s the shape of the system we were debugging:
Three entry points — a gRPC server, a Kafka consumer, and a Pub/Sub subscriber — all feeding a shared workflow engine that reads and writes to MongoDB. All inbound traffic passes through the NAT gateway.
The spike was on the inbound side. That meant something was pulling a lot of data into the system, not pushing it out.
The First Hypothesis: Network Configuration
The first instinct on any NAT spike is network config. New infrastructure, new VPC settings, something misconfigured at the routing layer. We checked subnet peering, VPC flow logs, firewall rules. Everything looked correct.
The traffic was real and it was coming from legitimate sources — Kafka brokers and MongoDB. This wasn’t a DDoS or a routing loop. Something in the application was genuinely pulling 100MB/s through the gateway.
Culprit 1: The Kafka MaxFetchBytes Multiplier
We started with Kafka because it was the highest-throughput entry point. Pulling up the consumer configuration:
config := kafka.ReaderConfig{
Brokers: brokers,
Topic: topic,
GroupID: groupID,
MaxBytes: 300 * 1024 * 1024, // 300MB
}
300MB MaxBytes. That’s the maximum amount of data the consumer will fetch in a single request. Already high, but not immediately alarming depending on message volume.
Then we looked at the driver wrapper. The Kafka client library we were using multiplied MaxBytes by 2 internally when allocating the protocol buffer for the fetch response. A safety margin baked into the driver.
300MB × 2 = 600MB per fetch request per topic.
With multiple topics and multiple consumer instances, the math got uncomfortable fast.
The Fix
Two changes. First, reduce MaxBytes to something reasonable for actual message sizes. Second, switch from the dynamic protocol buffer allocation (which applied the ×2 multiplier) to a fixed buffer size:
config := kafka.ReaderConfig{
Brokers: brokers,
Topic: topic,
GroupID: groupID,
MaxBytes: 10 * 1024 * 1024, // 10MB — right-sized for actual messages
// Fixed buffer instead of dynamic allocation
// avoids the internal ×2 multiplier
}
This alone brought the inbound traffic down significantly. But not to baseline. Something else was still pulling data.
Culprit 2: The Unprojected MongoDB Query
With Kafka addressed, we turned to the heap. Using Go’s pprof to capture a heap profile:
curl http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof heap.prof
(pprof) top10
Showing nodes accounting for 2.1GB, 94% of 2.3GB total
Flat Flat% Sum% Cum Cum% Name
1.4GB 60% 60% 1.4GB 60% encoding/json.(*decodeState).object
0.4GB 17% 77% 0.4GB 17% go.mongodb.org/mongo-driver/bson.Unmarshal
...
1.4GB in JSON decode, 400MB in BSON unmarshal. Something was deserializing enormous documents into memory and holding them there.
Tracing back through the call stack, we found the workflow engine using a generic persistence method to fetch documents from MongoDB:
// Generic method — fetches the entire document
func (r *Repository) FindByID(ctx context.Context, id string) (*WorkflowDocument, error) {
result := r.collection.FindOne(ctx, bson.M{"_id": id})
var doc WorkflowDocument
if err := result.Decode(&doc); err != nil {
return nil, err
}
return &doc, nil
}
WorkflowDocument was a large struct — dozens of fields, nested arrays, embedded documents accumulated over years of feature development. The generic FindByID pulled the entire thing every time.
The callers? Most of them only needed two or three fields.
The Fix
Add projection support to the repository layer so callers can declare exactly what they need:
func (r *Repository) FindByID(
ctx context.Context,
id string,
projection bson.M,
) (*WorkflowDocument, error) {
opts := options.FindOne()
if len(projection) > 0 {
opts.SetProjection(projection)
}
result := r.collection.FindOne(ctx, bson.M{"_id": id}, opts)
var doc WorkflowDocument
if err := result.Decode(&doc); err != nil {
return nil, err
}
return &doc, nil
}
Callers now declare their intent explicitly:
// Before — pulled 2MB, used 200 bytes
doc, err := repo.FindByID(ctx, id, nil)
status := doc.Status
// After — pulls exactly what's needed
doc, err := repo.FindByID(ctx, id, bson.M{
"status": 1,
"updatedAt": 1,
})
status := doc.Status
The heap profile after the fix told the story clearly — JSON decode dropped from 1.4GB to under 100MB. The OOMKilled replicas stopped. Inbound traffic returned to baseline.
Why the Migration Made It Visible
This is the part worth understanding — these bugs existed before the migration. So why did moving to new infrastructure expose them?
The old infrastructure had the MongoDB instance co-located in the same VPC, connected over a private network with no NAT gateway in the path. The new infrastructure routed MongoDB traffic through the NAT gateway. Suddenly, every byte pulled from MongoDB became visible as metered inbound traffic.
The bugs didn’t get worse. The measurement point changed.
This is a pattern worth internalising: infrastructure migrations don’t just move your workload, they change what’s observable. A migration is one of the best debugging tools you have — not because it causes problems, but because it illuminates ones that were already there.
What pprof Told Us
The heap profile was the key to finding the MongoDB issue. A few commands that were most useful:
# Capture a heap profile from a running service
curl http://localhost:6060/debug/pprof/heap > heap.prof
# Top allocations by flat size
go tool pprof heap.prof
(pprof) top10 -cum
# Visualise the full call graph
go tool pprof -http=:8080 heap.prof
# Inuse vs alloc — inuse shows what's currently held in memory
go tool pprof -inuse_space heap.prof
The -inuse_space flag was particularly useful here — it showed what was being held in memory, not just what had been allocated and freed. That’s what surfaced the 1.4GB of JSON decode state being kept alive while the workflow processed documents.
The Broader Pattern: Overfetching is a Network Problem
Both fixes were about the same root cause — fetching more data than needed. One at the Kafka layer, one at the database layer. Both manifested as a network problem because data has to travel to get to your application.
This is easy to miss because overfetching feels like a performance problem local to the service. In reality it’s distributed — every byte you pull unnecessarily is a byte that crosses a network, consumes memory, and occupies a CPU deserializing it. In a cloud environment with metered egress and NAT gateways, it becomes a cost problem too.
A few questions worth asking regularly across any system with this shape:
Kafka: What is MaxBytes set to, and does it reflect actual message sizes or a “just to be safe” value someone set once and never revisited?
MongoDB: Are queries using projection? Is the generic FindByID being called by callers that only need a subset of fields?
gRPC: Are field masks being used to limit response payload size on large response types?
General: What does your heap profile look like? Where is memory being held, and is the data actually being used?
The Results
After both fixes — Kafka MaxBytes right-sized with fixed buffer allocation, MongoDB queries with explicit projection:
- NAT gateway inbound traffic dropped from 100MB/s back to baseline
- Heap usage reduced by ~80% on the workflow engine replicas
- OOMKilled events stopped entirely
- Kafka consumer lag improved — smaller fetch sizes meant faster processing cycles
The system didn’t get faster because we optimised a hot path. It got faster because we stopped carrying weight it never needed to carry.
If you’ve hit something similar — a migration making a pre-existing problem visible — I’d love to hear about it.