Case study
DevOps at scale on critical infrastructure — GitLab CI forge for 450+ apps & Chaos Engineering GameDays at Enedis (via Klanik)
At Enedis, France's primary electricity distribution operator, I helped build from scratch a Kubernetes-based GitLab CI forge serving 450+ applications, then co-designed large-scale Chaos Engineering GameDays simulating DDoS, database corruption and secret leaks. The common goal: make teams anticipate failure instead of reacting to it.
- GitLab CI
- Kubernetes
- HashiCorp Vault
- Sonatype Nexus
- SonarQube
- Terraform
- EKS
- Grafana
- Chaos Engineering
- Discord
TL;DR
At Enedis, France’s primary electricity distribution operator, I helped build from scratch a Kubernetes-based GitLab CI forge serving 450+ applications, then co-designed large-scale Chaos Engineering GameDays simulating DDoS, database corruption and secret leaks. The common goal: make teams anticipate failure instead of reacting to it.
Context
Enedis operates ~95% of France’s electricity distribution grid. This is critical national infrastructure: reliability, traceability, and security are not “best practices” — they are operational requirements.
When I arrived (mission via Klanik, Jan 2022), there was no centralized CI/CD forge. Teams had heterogeneous pipelines, scattered tooling, and uneven security practices. Some projects had decent automation, others were closer to handcrafted scripts. There was no single entry point, no standard templates, no unified way to handle secrets, artifacts, scans, or runners at scale.
At the same time, hundreds of applications (≈450 GitLab projects) and a very large developer population needed CI/CD every day.
Our DevOps team (5 people) was tasked with an ambitious objective:
Build a single, industrialized CI/CD forge for the entire company, on Kubernetes, that could scale up and down automatically, and make “good DevOps” the default for everyone.
Later in the mission, I also co-built a Chaos Engineering platform used for large GameDays involving 100+ participants, designed not to “break things for fun” but to educate teams on observability, monitoring, and anticipating production failures.
These two storylines share the same philosophy: platforms and exercises as tools to change engineering behavior at scale.
Constraints
- Critical infrastructure context → we could not afford instability in the forge itself
- High security requirements → strict secrets handling, auditability, traceability
- 450+ applications, many teams, many languages and stacks
- Need for auto-scaling runners → cost control + performance
- Need to integrate existing tools rather than replace everything
- For GameDays: 100+ participants, multi-team coordination, realistic scenarios on real clusters
Decision — Industrializing the GitLab CI forge
We decided to make GitLab the single CI/CD entry point for the company, backed by a Kubernetes platform running:
- Auto-scaled GitLab runners (HPA)
- Centralized secrets with HashiCorp Vault
- Artifact repository with Sonatype Nexus
- Code quality & security scans via SonarQube
- Infrastructure as Code deployments with Terraform
The key was not just hosting runners. The key was templates.
We built 40+ modular CI/CD templates covering:
- Multiple languages and frameworks
- Build, test, Docker build, push, release
- Security scans by default
- Terraform plan/apply pipelines
- Standardized deployment patterns
Any new project at Enedis could onboard and instantly get a full DevOps lifecycle by including a few lines in .gitlab-ci.yml.
This turned CI/CD from a per-team responsibility into a shared platform capability.
Template architecture (core idea)
- Heavy use of
include - Versioned templates
- Composable blocks (build, test, scan, docker, deploy)
- Opinionated defaults, overridable when needed
Teams didn’t have to reinvent pipelines. They consumed building blocks.
Decision — Chaos Engineering as pedagogy
Separately, an internal entity (NIDC) wanted to run large GameDays.
With another DevOps engineer, I built a fully automated, reproducible chaos platform:
- Real Kubernetes clusters created for the event
- Entire infra provisioned with Terraform
- Automated disaster scenarios:
- DDoS
- Database corruption
- Secret leaks
- Per-team observability spaces with Grafana Labs dashboards
- Discord automation: one channel per team, event orchestration, instructions, monitoring
The goal was competitive and educational: teams had to detect, understand, and fix issues as fast as possible.
This wasn’t about breaking infra. It was about teaching teams:
Logs, metrics, monitoring, and tests in production are not optional.
What I built
The forge platform (high level)
Developers → GitLab CI → K8s runners (HPA)
↓
Vault / Nexus / Sonar
↓
Docker registry / Terraform / Deploy
- Runners scaled automatically depending on load
- Secrets injected dynamically via Vault
- Artifacts stored in Nexus
- Security scans everywhere by default
- Forge itself designed to scale down when idle
We even designed an automated DRP for the forge itself.
The GameDay platform
Terraform → EKS clusters → Injected failures
↓
Grafana dashboards
↓
Discord per team
Everything was ready before the event. On the day, we only had to “press start”.
Outcome
Forge
- 450+ applications onboarded
- Standardized CI/CD across the company
- Security scans by default everywhere
- Massive improvement in developer onboarding
- Reliable, auto-scaled runner platform
- A forge that teams trusted and adopted
Chaos GameDays
- 100+ participants
- Teams discovering gaps in monitoring and alerting they didn’t know they had
- Clear behavioral shift between first and later GameDays:
- Teams came better prepared
- Monitoring was taken seriously afterward
- Educational impact far beyond the event itself
What I’d do differently
- Invest earlier in documentation for templates (adoption would have been even faster)
- Productize the GameDay platform as a reusable internal product
- Measure more formal metrics on developer experience improvements
Keep reading
-
Bedrock Streaming · 8 min
Progressive delivery at Bedrock — Argo Rollouts for 50+ product teams
Rolled out Argo Rollouts as the standard progressive-delivery framework for 50+ product teams. Replaced health-check gates with metric-based gates (Apdex, error rate, custom KPIs). Contributed upstream support for KEDA + Gloo Gateway.
Read
-
Bedrock Streaming · 7 min
Bedrock load-testing platform — 10× capacity at 90% lower cost
Rebuilt Bedrock's load-testing platform on a hybrid Amazon EC2 + ECS architecture integrated with Gatling Enterprise Cloud. Result: ~10× more capacity at ~90% lower cost — by treating FinOps as an architecture problem, not a finance ticket.
Read
-
Eliza Labs · elizaOS Cloud · 8 min
Multi-tenant isolation in elizaOS Cloud — PostgreSQL Row-Level Security, encryption & full E2E safety model
At elizaOS Cloud (10k+ users on a shared PostgreSQL backend), I moved tenant isolation from application logic into PostgreSQL using Row-Level Security (RLS), added transparent encryption for sensitive agent data, and built a full end-to-end test suite on real databases — including bidirectional migration tests — to guarantee correctness of multi-tenant isolation under production conditions.
Read