Case study
Bedrock load-testing platform — 10× capacity at 90% lower cost
Rebuilt Bedrock's load-testing platform on a hybrid Amazon EC2 + ECS architecture integrated with Gatling Enterprise Cloud. Result: ~10× more capacity at ~90% lower cost — by treating FinOps as an architecture problem, not a finance ticket.
- Amazon EC2
- Amazon ECS
- Gatling Enterprise Cloud
- AWS
- FinOps
- performance engineering
- load testing
TL;DR
I rebuilt Bedrock’s load-testing platform on a hybrid Amazon EC2 + Amazon ECS architecture integrated with Gatling Enterprise Cloud. Result: ~10× more load capacity at ~90% lower cost — by treating FinOps as an architecture problem, not a finance ticket.
Context
At Bedrock Streaming, load testing is not optional. With 20M+ weekly viewers and dozens of product teams shipping continuously, performance validation is part of the delivery culture. Every team needs to simulate real traffic patterns against real services, often at very high concurrency.
Before I touched the platform, load testing lived in two worlds:
- Individual teams running scripts (sometimes Postman collections) on their own
- A centralized setup based on Gatling Cloud, introduced by the DevOps team before I arrived
The intention was good: provide a managed, scalable load-testing solution. The problem was the hosting model.
Gatling Cloud provisions ephemeral machines billed by the minute, with relatively small instance templates. There was no real distinction between “number of users simulated” and “number of machines required.” If you wanted to simulate more users, you had to start more machines.
Two technical limits amplified the problem:
- Performance limits of small instances
- Linux ephemeral port limits per IP (≈64k), which becomes a hard ceiling when you try to simulate large numbers of concurrent users from a single machine
To simulate realistic user traffic at Bedrock’s scale, teams had to spawn many machines per test. Costs scaled linearly with ambition.
As more teams started doing serious performance testing, the bill started exploding.
Constraints
Several constraints shaped the redesign:
- AWS-native infrastructure was the norm at Bedrock
- 50+ teams needed self-service load testing
- The solution had to remain compatible with Gatling Cloud (tool choice predated me)
- Tests had to be cheap enough that teams would not hesitate to run them often
- Very high concurrency targets (hundreds of thousands to millions of virtual users)
- No appetite for a long migration or tool change — this had to be an architectural optimization, not a cultural one
Trigger
There was no incident.
No management mandate.
This was a proactive initiative I pushed because the trajectory was obvious: more teams → more tests → exponential cost growth.
I framed it as: if we don’t change the architecture now, we will soon have to tell teams to test less. That is the opposite of what you want in a streaming platform.
Decision — Why hybrid EC2 + ECS
Because Gatling Cloud already relied on ECS + EC2 under the hood, the solution had to stay compatible with that model.
The key shift was conceptual:
Stop renting many small machines from Gatling Cloud. Start running a few very large machines inside our own AWS account.
I designed a hybrid model:
- A control plane running on Amazon ECS
- Dedicated, large EC2 instances acting as load generators, registered into Gatling Cloud
Instead of Gatling provisioning ephemeral workers for us, we provided our own workers — much bigger, much fewer, much cheaper.
This solved both problems:
- Port/IP limits: large instances with multiple ENIs and IPs
- Performance limits: far more CPU and memory per worker
And most importantly: cost.
What I built
High-level architecture
CI / Manual trigger
│
▼
ECS Control Plane
│
▼
Large EC2 Load Generators (self-managed)
│
▼
Gatling Cloud Orchestration
│
▼
Targets inside Bedrock VPC
The idea
Gatling Cloud allows external load generators to connect to it. Instead of letting Gatling spin small instances billed by the minute, I created a pool of very large EC2 instances in our infrastructure that would attach to Gatling as workers.
Fewer machines, but far more powerful.
Why this changes everything
Previously:
- To simulate more users → add more Gatling machines → linear cost growth
Now:
- To simulate more users → use the headroom of already-running large machines → near-zero marginal cost
We moved from a per-test provisioning model to a capacity pool model.
ECS control plane
ECS hosted the orchestration layer that:
- Registered/deregistered EC2 load generators
- Managed lifecycle and connectivity with Gatling Cloud
- Allowed teams to trigger tests the same way as before (no workflow change)
From the teams’ perspective, nothing changed.
From the bill’s perspective, everything changed.
Networking and realism
These EC2 instances lived in Bedrock’s AWS network, close to the systems under test. They could simulate traffic with many IPs and realistic concurrency without hitting the per-machine port ceiling that plagued the old model.
Capacity jump
Before: we were effectively capped around ~500,000 concurrent users because of the need to spawn too many small machines.
After: we could exceed 2,000,000 concurrent users using a handful of very large instances.
Not because Gatling changed. Because the infrastructure did.
Implementation time
The whole platform redesign took about one month.
No migration plan was needed. Teams kept using Gatling the same way. They just benefited from a different backend.
Outcome
The headline numbers
- ~10× more load capacity
- ~90% cost reduction
This was not measured with synthetic benchmarks. It was visible directly on the AWS and Gatling invoices.
Instead of paying for dozens of small ephemeral instances per test, we paid for a few large instances running continuously in our account.
Cultural effect
The most important effect was psychological:
Teams stopped worrying about how expensive their tests were.
They could test more, test bigger, test longer.
That is exactly what you want in performance engineering.
FinOps as architecture
This project is the best example of something I strongly believe:
Latency, errors, and euros per user belong on the same dashboard.
The cost problem was not solved by negotiation with a vendor or by asking teams to be careful. It was solved by changing the architecture.
What I’d do differently
Very little.
The solution was intentionally simple and pragmatic because the tool choice (Gatling) and workflow were fixed constraints.
If I had more time, I would probably add richer observability and historical reporting around test runs and resource consumption. But the core architectural decision is something I would reuse as-is.
Keep reading
-
Bedrock Streaming · 8 min
Progressive delivery at Bedrock — Argo Rollouts for 50+ product teams
Rolled out Argo Rollouts as the standard progressive-delivery framework for 50+ product teams. Replaced health-check gates with metric-based gates (Apdex, error rate, custom KPIs). Contributed upstream support for KEDA + Gloo Gateway.
Read
-
Enedis · via Klanik · 8 min
DevOps at scale on critical infrastructure — GitLab CI forge for 450+ apps & Chaos Engineering GameDays at Enedis (via Klanik)
At Enedis, France's primary electricity distribution operator, I helped build from scratch a Kubernetes-based GitLab CI forge serving 450+ applications, then co-designed large-scale Chaos Engineering GameDays simulating DDoS, database corruption and secret leaks. The common goal: make teams anticipate failure instead of reacting to it.
Read
-
Eliza Labs · elizaOS Cloud · 8 min
Multi-tenant isolation in elizaOS Cloud — PostgreSQL Row-Level Security, encryption & full E2E safety model
At elizaOS Cloud (10k+ users on a shared PostgreSQL backend), I moved tenant isolation from application logic into PostgreSQL using Row-Level Security (RLS), added transparent encryption for sensitive agent data, and built a full end-to-end test suite on real databases — including bidirectional migration tests — to guarantee correctness of multi-tenant isolation under production conditions.
Read