Smart Canary Deployments
Table of Contents
TL;DR #
Canary deployments? Yeah, I’d never build that myself. It’s either already set up somewhere and you don’t have visibility into it, or nobody bothers with it. So I thought, why not start from scratch and see what’s involved? Turns out it’s actually pretty straightforward with argo-rollouts.
But not just simple rollouts — we’re gonna make it smart. When it sees errors in the metrics, it automatically aborts the deployment and stops traffic to the failing version.
Start Easy #
|
|
Well, that’s it! Simple rollouts done! Of course, you’ll need argo-rollouts installed.
Note: You can find the complete version in arch-manifest
Let’s start by deploying the first version and checking the rollout status:
$ kubectl-argo-rollouts get rollout rpc-server
NAME KIND STATUS AGE INFO
⟳ rpc-server Rollout ✔ Healthy 16s
└──# revision:1
└──⧉ rpc-server-5ff779f75f ReplicaSet ✔ Healthy 16s stable
├──□ rpc-server-5ff779f75f-fbmj8 Pod ✔ Running 16s ready:1/1
├──□ rpc-server-5ff779f75f-qn27n Pod ✔ Running 16s ready:1/1
└──□ rpc-server-5ff779f75f-vslc6 Pod ✔ Running 16s ready:1/1
Now let’s deploy a new version of the deployment, and you’ll see revision 2 like this:
$ kubectl-argo-rollouts get rollout rpc-server
NAME KIND STATUS AGE INFO
⟳ rpc-server Rollout ॥ Paused 23h
├──# revision:2
│ └──⧉ rpc-server-b9d7c8675 ReplicaSet ✔ Healthy 23h canary
│ └──□ rpc-server-b9d7c8675-6lpxh Pod ✔ Running 23h ready:1/1,restarts:1
└──# revision:1
└──⧉ rpc-server-5ff779f75f ReplicaSet ✔ Healthy 23h stable
├──□ rpc-server-5ff779f75f-fbmj8 Pod ✔ Running 23h ready:1/1,restarts:1
├──□ rpc-server-5ff779f75f-qn27n Pod ✔ Running 23h ready:1/1,restarts:2
└──□ rpc-server-5ff779f75f-vslc6 Pod ✔ Running 23h ready:1/1,restarts:1
Let’s say you’ve checked the canary status — now we can move to the next step of promoting the deployment:
$ kubectl-argo-rollouts promote rpc-server
# After about 1 minute of waiting, you'll see the second version rolled out completely
$ kubectl-argo-rollouts get rollout rpc-server
NAME KIND STATUS AGE INFO
⟳ rpc-server Rollout ✔ Healthy 23h
├──# revision:2
│ └──⧉ rpc-server-b9d7c8675 ReplicaSet ✔ Healthy 23h stable
│ ├──□ rpc-server-b9d7c8675-ftm8w Pod ✔ Running 2m2s ready:1/1
│ ├──□ rpc-server-b9d7c8675-4cvbz Pod ✔ Running 100s ready:1/1
│ └──□ rpc-server-b9d7c8675-dg4z9 Pod ✔ Running 69s ready:1/1
└──# revision:1
└──⧉ rpc-server-5ff779f75f ReplicaSet • ScaledDown 23h
What happens during rollout:
80% Stable, 20% Canary] B --> C[Pause
Manual promotion required] C --> D[setWeight
50% Stable, 50% Canary] D --> E[Wait 30 seconds] E --> F[setWeight 100
0% Stable, 100% Canary] F --> G[Rollout Complete
Canary becomes new Stable]
The Problem with Basic Rollouts #
Notice something important here — we’re only using replica counts for traffic splitting, not actual request percentages. When we set setWeight: 20
, we’re telling Argo Rollouts to spin up pods that represent 20% of the total replica count.
But here’s the catch: This doesn’t guarantee that exactly 20% of your traffic hits the canary pods. Load balancing, connection pooling, and client behavior can all skew the actual traffic distribution.
For true traffic-based canary deployments, we need a service mesh to control the actual request routing, not just pod counts (there are other ways, of course).
Istio #
There are many ways to implement traffic routing with rollouts — I just love Istio. With the combination of VirtualService
and DestinationRule
, we can control actual request traffic percentages, not just pod counts.
Istio’s service mesh intercepts all traffic and routes it according to our weight specifications. When Argo Rollouts updates these weights during a canary deployment, Istio ensures the exact traffic distribution we want.
|
|
Now add the traffic routing policy to your rollout:
|
|
Bonus: Istio gives you observability into this traffic split through metrics and distributed tracing, so you can see exactly how your canary is performing under real load.
Analysis #
This is where it gets smart. Instead of manually checking metrics and deciding whether to promote or rollback, we let Prometheus metrics decide automatically.
The ClusterAnalysisTemplate
defines what “success” looks like, and Argo Rollouts will continuously monitor these metrics during the canary phase.
|
|
Now integrate this analysis template into your rollout strategy:
|
|
What this does:
- Success rate monitoring — Queries Istio metrics to calculate the percentage of successful requests (non-5xx responses)
- Automatic abort on failure — If success rate drops below 95% for 3 consecutive checks, rollout automatically aborts and stops traffic to canary
- Continuous analysis — Runs every 5 minutes during the canary phase, ensuring real-time monitoring
- Smart traffic control — Prevents bad deployments from receiving production traffic
Here’s what the rollout status looks like with analysis running and failed:
$ kubectl-argo-rollouts get rollout rpc-server
NAME KIND STATUS AGE INFO
⟳ rpc-server Rollout ॥ Paused 4d18h
└──# revision:2
├──⧉ rpc-server-b9d7c8675 ReplicaSet ✔ Healthy 4d18h stable
│ ├──□ rpc-server-b9d7c8675-ftm8w Pod ✔ Running 3d18h ready:1/1
│ ├──□ rpc-server-b9d7c8675-4cvbz Pod ✔ Running 3d18h ready:1/1
│ └──□ rpc-server-b9d7c8675-dg4z9 Pod ✔ Running 3d18h ready:1/1
├──α rpc-server-b9d7c8675-2 AnalysisRun ⚠Error 3d18h ⚠5
└──α rpc-server-b9d7c8675-2.1 AnalysisRun ✔ Successful 3d18h ✔ 1
Notice the AnalysisRun
entries showing the automated analysis results — one failed with errors, one succeeded. This is your canary deployment making intelligent decisions based on real metrics, automatically aborting when things go wrong and keeping traffic on the stable version.
In this case I identified and resolved the underlying issue, then continued with the rollout promotion.
Wrapping Up #
That’s it! You’ve built a smart canary deployment system that:
- Uses Istio for precise traffic control — Real percentage-based routing, not just pod counts
- Monitors real metrics with Prometheus — Success rates, error rates, whatever matters to your app
- Automatically prevents bad deployments — Stops traffic to failing versions without manual intervention
- Integrates seamlessly — Works with your existing Kubernetes setup
The beauty of this approach is that it removes the guesswork. Instead of manually deciding whether a canary is “good enough,” you define what success looks like upfront and let the system automatically abort failed deployments while keeping traffic on the stable version.
Just data-driven, automated progressive delivery with intelligent failure handling.