Case study

Progressive delivery at Bedrock — Argo Rollouts on a 1000-node cluster

Bedrock Streaming · 8 min read

Context

Bedrock Streaming serves around 40 million weekly viewers across multiple regions on top of an EKS platform — over 1 000 nodes, 50+ product teams, constant deploys.

When I joined, every team’s CI ran through GitHub Actions with home-grown templates that had been built brick by brick over the years and never properly refactored. It worked, but it was fragile under change. Every team had slightly different deployment needs, and every adjustment to the templates risked breaking someone. As a small platform team, that was unsustainable.

Constraints

Deployment-window sensitivity is real. A bad rollout to a streaming path hits paying customers, in real time, on devices we don’t own.
Teams own their releases, not the platform team. We needed a system that pushed accountability back, not one that centralized rollout decisions in a small group of SREs.
Network stack edge cases — we used both Nginx and Gloo, plus KEDA for scale-to-zero on parts of the platform. Both had partial upstream support in Argo Rollouts at the time.

Decision

Move delivery to ArgoCD and progressive deployment to Argo Rollouts. Two goals:

Free platform engineers from per-team rollout babysitting.
Push real ownership of the release strategy back to product teams, with sane defaults that worked out of the box.

Argo Rollouts was the right fit because it supported metric-based gates natively (via AnalysisTemplate) and didn’t require teams to rewrite their applications.

What I built

A library of overridable templates

Every team got a default canary strategy in their Helm chart, with sensible weight steps and gates. Teams that needed something different could override surgically — change weights, swap in custom metrics, move to blue/green — without forking the whole template.

Metric-based rollback, beyond health checks

Health checks tell you a pod is up. They don’t tell you the pod is serving the experience your users expect. We standardized on a small core of business-meaningful signals that every team got out of the box:

Apdex score (New Relic’s user-satisfaction index) — a single number capturing both latency and the share of users actually impacted.
5xx error rate, with thresholds tightening progressively as the canary advanced through its weight steps.
Success rate on canary traffic.
Response time, same tightening pattern.

Plus per-team custom metrics, plugged into the same gating mechanism via AnalysisTemplate.

The reason we went beyond health checks: a pod can be Ready and serving 5xx all day long, or it can be serving 200s but at a latency that destroys the user experience. Apdex + error rate + success rate caught all three failure modes — broken pods, broken responses, degraded experience — which is what made the gates trustworthy enough to actually run unattended.

Stateful workloads got a different default

Canary works when you can run two versions in parallel on the same traffic without breaking anything. So we offered three paths:

Canary for stateless services (default).
Blue/green override for services holding session state in memory or anything where v1+v2 cohabitation would cause inconsistency.
Plain rolling update with health checks and queue-lag monitoring for Kafka / SQS workers — there’s no traffic to split for those.

Schema migrations were handled team-side with expand-and-contract discipline (additive change first, then code that uses it, then later removal of the old path). Documented and reviewed during onboarding. Databases themselves were explicitly out of scope.

A real rollback that fired

One of the streaming-path backends shipped a regression on a hot endpoint — under real traffic patterns, the new query plan hammered the database harder than expected. Staging missed it because traffic shape was different.

In production, the canary got a small slice of traffic. Apdex on the canary slice dropped meaningfully below threshold within the first weight step. The AnalysisRun aborted automatically. The stable version kept serving the rest, so user impact was contained for the duration of the canary window — minutes, not hours.

Without the gate, that regression would have rolled out fleet-wide on a standard rolling update. That’s exactly the case Argo Rollouts pays for itself on.

A false positive that taught us more

Early on, we had cases where a canary rolled back not because the deployed code was bad, but because an upstream internal dependency was degraded. The service under test was correctly returning 5xx to its callers, the gates fired, the rollout aborted — but the fix was somewhere else entirely.

The lesson was about being precise on what a gate is actually measuring. We started distinguishing errors by origin in the analysis queries — separating “this service failed” from “this service propagated a failure from a dependency it doesn’t own.” Where it made sense we added bypass logic for specific upstream-attributable error classes.

The mental model I took away:

A canary gate should measure the quality of the change you just shipped, not the health of the whole call graph.

If your gate fires on signals you don’t control, you’ll either rollback for the wrong reason, or — worse — teams will stop trusting rollbacks and start ignoring them.

Open source contributions

Our network stack had edge cases that weren’t well supported upstream. I contributed back to ArgoCD and Argo Rollouts to extend the controller and CRDs to play nicely with KEDA (scale-to-zero) and with Gloo (an ingress controller we used alongside Nginx). That work let us run progressive delivery cleanly on parts of the platform that weren’t on the standard Nginx path, and it meant we weren’t carrying internal forks long-term.

Outcome

Platform team stopped being the rollout bottleneck.
50+ product teams gained real progressive delivery with metric-based gates instead of just “is the pod up?”.
Stateful workloads got a discipline (expand/contract, blue/green) and a written playbook, not a hand-wave.
The unusual bits got pushed upstream, so we weren’t maintaining internal forks of ArgoCD or Argo Rollouts long-term.

What I’d do differently

Earlier focus on gate observability. The first version of metric gates was a black box for product teams — when a rollout aborted, they saw “AnalysisRun failed” and not much else. We later added Slack notifications with the actual metric values and which threshold tripped. That should have been there on day one, not bolted on after the first false-positive frustrations.

← Home