Skip to content
Stan

All case studies

Case study

DevOps at scale on critical infrastructure — GitLab CI forge for 450+ apps & Chaos Engineering GameDays at Enedis (via Klanik)

At Enedis, France's primary electricity distribution operator, I helped build from scratch a Kubernetes-based GitLab CI forge serving 450+ applications, then co-designed large-scale Chaos Engineering GameDays simulating DDoS, database corruption and secret leaks. The common goal: make teams anticipate failure instead of reacting to it.

Enedis · via Klanik 8 min
  • GitLab CI
  • Kubernetes
  • HashiCorp Vault
  • Sonatype Nexus
  • SonarQube
  • Terraform
  • EKS
  • Grafana
  • Chaos Engineering
  • Discord

TL;DR

At Enedis, France’s primary electricity distribution operator, I helped build from scratch a Kubernetes-based GitLab CI forge serving 450+ applications, then co-designed large-scale Chaos Engineering GameDays simulating DDoS, database corruption and secret leaks. The common goal: make teams anticipate failure instead of reacting to it.

Context

Enedis operates ~95% of France’s electricity distribution grid. This is critical national infrastructure: reliability, traceability, and security are not “best practices” — they are operational requirements.

When I arrived (mission via Klanik, Jan 2022), there was no centralized CI/CD forge. Teams had heterogeneous pipelines, scattered tooling, and uneven security practices. Some projects had decent automation, others were closer to handcrafted scripts. There was no single entry point, no standard templates, no unified way to handle secrets, artifacts, scans, or runners at scale.

At the same time, hundreds of applications (≈450 GitLab projects) and a very large developer population needed CI/CD every day.

Our DevOps team (5 people) was tasked with an ambitious objective:

Build a single, industrialized CI/CD forge for the entire company, on Kubernetes, that could scale up and down automatically, and make “good DevOps” the default for everyone.

Later in the mission, I also co-built a Chaos Engineering platform used for large GameDays involving 100+ participants, designed not to “break things for fun” but to educate teams on observability, monitoring, and anticipating production failures.

These two storylines share the same philosophy: platforms and exercises as tools to change engineering behavior at scale.

Constraints

  • Critical infrastructure context → we could not afford instability in the forge itself
  • High security requirements → strict secrets handling, auditability, traceability
  • 450+ applications, many teams, many languages and stacks
  • Need for auto-scaling runners → cost control + performance
  • Need to integrate existing tools rather than replace everything
  • For GameDays: 100+ participants, multi-team coordination, realistic scenarios on real clusters

Decision — Industrializing the GitLab CI forge

We decided to make GitLab the single CI/CD entry point for the company, backed by a Kubernetes platform running:

  • Auto-scaled GitLab runners (HPA)
  • Centralized secrets with HashiCorp Vault
  • Artifact repository with Sonatype Nexus
  • Code quality & security scans via SonarQube
  • Infrastructure as Code deployments with Terraform

The key was not just hosting runners. The key was templates.

We built 40+ modular CI/CD templates covering:

  • Multiple languages and frameworks
  • Build, test, Docker build, push, release
  • Security scans by default
  • Terraform plan/apply pipelines
  • Standardized deployment patterns

Any new project at Enedis could onboard and instantly get a full DevOps lifecycle by including a few lines in .gitlab-ci.yml.

This turned CI/CD from a per-team responsibility into a shared platform capability.

Template architecture (core idea)

  • Heavy use of include
  • Versioned templates
  • Composable blocks (build, test, scan, docker, deploy)
  • Opinionated defaults, overridable when needed

Teams didn’t have to reinvent pipelines. They consumed building blocks.

Decision — Chaos Engineering as pedagogy

Separately, an internal entity (NIDC) wanted to run large GameDays.

With another DevOps engineer, I built a fully automated, reproducible chaos platform:

  • Real Kubernetes clusters created for the event
  • Entire infra provisioned with Terraform
  • Automated disaster scenarios:
    • DDoS
    • Database corruption
    • Secret leaks
  • Per-team observability spaces with Grafana Labs dashboards
  • Discord automation: one channel per team, event orchestration, instructions, monitoring

The goal was competitive and educational: teams had to detect, understand, and fix issues as fast as possible.

This wasn’t about breaking infra. It was about teaching teams:

Logs, metrics, monitoring, and tests in production are not optional.

What I built

The forge platform (high level)

Developers → GitLab CI → K8s runners (HPA)

                 Vault / Nexus / Sonar

                 Docker registry / Terraform / Deploy
  • Runners scaled automatically depending on load
  • Secrets injected dynamically via Vault
  • Artifacts stored in Nexus
  • Security scans everywhere by default
  • Forge itself designed to scale down when idle

We even designed an automated DRP for the forge itself.

The GameDay platform

Terraform → EKS clusters → Injected failures

                     Grafana dashboards

                     Discord per team

Everything was ready before the event. On the day, we only had to “press start”.

Outcome

Forge

  • 450+ applications onboarded
  • Standardized CI/CD across the company
  • Security scans by default everywhere
  • Massive improvement in developer onboarding
  • Reliable, auto-scaled runner platform
  • A forge that teams trusted and adopted

Chaos GameDays

  • 100+ participants
  • Teams discovering gaps in monitoring and alerting they didn’t know they had
  • Clear behavioral shift between first and later GameDays:
    • Teams came better prepared
    • Monitoring was taken seriously afterward
  • Educational impact far beyond the event itself

What I’d do differently

  • Invest earlier in documentation for templates (adoption would have been even faster)
  • Productize the GameDay platform as a reusable internal product
  • Measure more formal metrics on developer experience improvements

All case studies

A role to fill, or just a conversation? Let’s talk.