It’s 2 AM. Your team just pushed a critical security patch to production. Users are still actively shopping, chatting, streaming—completely unaware that you’re replacing the very servers handling their requests. No downtime. No service interruption. Just a seamless transition from old to new.

This is the promise of rolling updates, and when done right, it’s one of the most elegant deployment strategies available to modern DevOps teams.

What Are Rolling Updates?

A rolling update is a deployment strategy that incrementally replaces instances of your application with new versions, ensuring that some instances remain available throughout the deployment process. Instead of taking down your entire application, updating it, and bringing it back up (the “big bang” approach), you update instances in batches while maintaining service availability.

Think of it like replacing planks on a bridge while people are still walking across—challenging, but absolutely possible with the right approach.

Why Rolling Updates Matter

The Zero-Downtime Imperative

In today’s always-on digital economy, downtime is measured not just in lost revenue, but in damaged reputation and customer trust. Consider these realities:

  • Financial Impact: For many e-commerce platforms, even 5 minutes of downtime during peak hours can mean tens of thousands in lost revenue
  • User Experience: Modern users expect 24/7 availability. A maintenance window announcement feels antiquated
  • Competitive Advantage: The ability to deploy features and fixes without service interruption is a genuine competitive differentiator

Faster Feedback Loops

Rolling updates enable a progressive rollout model. You can:

  • Deploy to a small percentage of your fleet first
  • Monitor metrics and error rates in real-time
  • Detect issues before they impact all users
  • Roll back quickly if problems emerge

Risk Mitigation

By updating incrementally, you create natural checkpoints. If instance 1 of 10 fails after update, you’ve impacted 10% of capacity, not 100%. This built-in safety mechanism is invaluable in production environments.

How Rolling Updates Work: The Mechanics

At its core, a rolling update follows this pattern:

  1. Remove an instance from the load balancer (or service mesh)
  2. Stop the old version on that instance
  3. Start the new version on that instance
  4. Run health checks to verify the new version is working
  5. Add the instance back to the load balancer
  6. Repeat for the next instance

The Critical Details

The devil is in the details. Here’s what actually happens behind the scenes:

Health Checks Are Everything

Your orchestrator relies on health checks to determine if a new instance is ready. You need two types:

  • Liveness probes: Is the application running?
  • Readiness probes: Is the application ready to serve traffic?

A common mistake is using the same endpoint for both. Your app might be “alive” (process running) but not “ready” (database connections not established).

Connection Draining

When you remove an instance from a load balancer, existing connections need time to complete. Good rolling update implementations:

  • Stop sending new requests to the instance
  • Wait for existing requests to complete (connection draining)
  • Only then terminate the old version

Skip this, and you’ll drop active requests—exactly what you’re trying to avoid.

Surge and Unavailable Instances

In Kubernetes terms:

  • maxSurge: How many extra instances you can create during update (e.g., 25% means for 4 instances, you can temporarily have 5)
  • maxUnavailable: How many instances can be down during update (e.g., 25% means for 4 instances, at most 1 can be down)

These parameters control update speed vs. resource usage. More surge = faster updates but higher temporary resource consumption.

Implementation Approaches

Kubernetes RollingUpdate Strategy

Kubernetes makes rolling updates a first-class citizen:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      containers:
      - name: app
        image: myapp:v2
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10

Why this configuration?

  • maxSurge: 1, maxUnavailable: 0: Never reduces capacity below desired state. Always maintains 4 instances, temporarily scaling to 5 during update.
  • readinessProbe: Waits 5 seconds after startup, then checks every 5 seconds. Instance only receives traffic when this passes.
  • livenessProbe: Gives app 15 seconds to start, then checks every 10 seconds. If it fails, kubelet restarts the container.

AWS ECS Rolling Deployments

Amazon ECS offers similar capabilities with different terminology:

{
  "deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 100,
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }
}
  • minimumHealthyPercent: 100: Never drop below desired capacity
  • maximumPercent: 200: Can double capacity during deployment
  • deploymentCircuitBreaker: Automatically rolls back failed deployments

Docker Swarm

Docker Swarm’s approach is similarly straightforward:

docker service update \
  --update-parallelism 2 \
  --update-delay 30s \
  --update-failure-action rollback \
  --image myapp:v2 \
  web-app

This updates 2 containers at a time, waits 30 seconds between batches, and automatically rolls back if updates fail.

Best Practices for Production

1. Implement Comprehensive Health Checks

Your health checks should verify:

  • Application process is running
  • Critical dependencies are reachable (database, cache, etc.)
  • Application can serve requests (not just that the port is open)

Example robust health check in Go:

func healthCheck(w http.ResponseWriter, r *http.Request) {
    // Check database connectivity
    if err := db.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "unhealthy",
            "reason": "database unreachable",
        })
        return
    }

    // Check Redis connectivity
    if err := redisClient.Ping().Err(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "unhealthy",
            "reason": "cache unreachable",
        })
        return
    }

    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{
        "status": "healthy",
    })
}

2. Set Appropriate Timeouts

Configure timeouts at every layer:

  • Health check timeout: How long to wait for health check response
  • Connection draining timeout: How long to wait for existing connections
  • Update timeout: Overall time limit for the entire rolling update

In Kubernetes:

spec:
  progressDeadlineSeconds: 600  # Fail deployment if not done in 10 min
  minReadySeconds: 10           # Wait 10s after ready before continuing
  readinessProbe:
    timeoutSeconds: 5           # Health check must respond in 5s
    successThreshold: 1         # Must succeed once to be ready
    failureThreshold: 3         # Allow 3 failures before marking unready

3. Use PodDisruptionBudgets (Kubernetes)

Prevent rolling updates from causing capacity issues during node maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: web-app

This ensures at least 3 pods are always available, even during voluntary disruptions like node drains.

4. Implement Graceful Shutdown

Your application should handle termination signals properly:

func main() {
    server := &http.Server{Addr: ":8080"}

    go func() {
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for interrupt signal
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    <-quit

    log.Println("Shutting down server...")

    // Give existing requests 30 seconds to complete
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := server.Shutdown(ctx); err != nil {
        log.Fatal("Server forced to shutdown:", err)
    }

    log.Println("Server exited gracefully")
}

5. Monitor Deployment Progress

Use metrics to track rollout health:

Key metrics to watch:

  • Error rate: Should not spike during rollout
  • Response latency: Should remain stable
  • Request success rate: Should stay at baseline
  • Health check failures: Should be zero or minimal

Set up alerts that automatically pause or rollback deployments if these metrics degrade.

Common Pitfalls and How to Avoid Them

Pitfall 1: Database Migrations

Problem: New code expects new schema, but rolling update means old and new code run simultaneously.

Solution: Use backward-compatible migrations:

  1. Deploy code that works with both old and new schema
  2. Run migration to add new columns/tables (don’t remove old ones yet)
  3. Deploy code that uses only new schema
  4. Run migration to remove old columns/tables

This “expand-contract” pattern ensures compatibility during rollout.

Pitfall 2: Incorrect Health Checks

Problem: Health check returns 200 before app is truly ready, causing traffic to fail.

Solution:

  • Separate readiness from liveness
  • Test dependencies in readiness probe
  • Use initialDelaySeconds to account for startup time
  • Don’t check external services in liveness (they’re not under your control)

Pitfall 3: Insufficient Resources

Problem: Using maxSurge creates temporary resource spike, hitting cluster limits.

Solution:

  • Set resource requests/limits on pods
  • Ensure cluster has headroom for surge instances
  • Or use maxSurge: 0 with maxUnavailable: 1 for resource-constrained environments (slower but no extra resources needed)

Pitfall 4: Ignoring Connection Draining

Problem: Terminating pods immediately drops active connections.

Solution:

  • Configure terminationGracePeriodSeconds
  • Implement graceful shutdown in application
  • Ensure load balancer respects connection draining
spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 10"]

The sleep gives load balancer time to deregister the pod before shutdown starts.

Pitfall 5: Too Aggressive Rollout

Problem: Updating many instances simultaneously increases risk if new version has issues.

Solution:

  • Start with conservative maxSurge and maxUnavailable
  • Use progressive delivery (canary) for high-risk changes
  • Implement automatic rollback based on metrics

When NOT to Use Rolling Updates

Rolling updates aren’t always the right choice:

Use Blue-Green Instead When:

  • Database migrations are complex: Can’t easily make backward-compatible changes
  • Need instant rollback: Blue-green offers near-instantaneous rollback by switching traffic
  • Testing new infrastructure: Want to validate entire new stack before switching

Use Canary Instead When:

  • High-risk changes: Want to test with small percentage of users first
  • Need gradual exposure: Progressive rollout based on metrics
  • A/B testing features: Want controlled comparison between versions

Use Recreate Strategy When:

  • Breaking changes: New and old versions absolutely cannot coexist
  • Resource constrained: Can’t afford any temporary surge in resource usage
  • Development/staging environments: Downtime is acceptable

Monitoring and Validation

Successful rolling updates require robust monitoring:

Pre-Deployment Checks

# Verify image exists
docker pull myapp:v2

# Check resource quotas
kubectl describe quota

# Verify configuration
kubectl diff -f deployment.yaml

During Deployment

# Watch rollout status
kubectl rollout status deployment/web-app

# Monitor pod events
kubectl get events --watch

# Check logs of new pods
kubectl logs -f deployment/web-app

Post-Deployment Validation

  • Run smoke tests against production
  • Verify metrics dashboards (Grafana, Datadog, etc.)
  • Check error tracking (Sentry, Rollbar, etc.)
  • Monitor SLA metrics (uptime, latency, error rate)

Real-World Example: Updating a Production API

Let’s walk through a real scenario:

Context: API serving 1000 req/sec, 10 instances, need to deploy security patch

Deployment configuration:

replicas: 10
rollingUpdate:
  maxSurge: 2        # Can temporarily have 12 instances
  maxUnavailable: 0  # Never drop below 10 instances

Timeline:

  1. T+0s: Deployment starts, creates 2 new pods (v2)
  2. T+15s: New pods pass health checks, join load balancer
  3. T+16s: Terminate 2 old pods, create 2 more new pods
  4. T+31s: Next 2 v2 pods ready, 2 more v1 pods terminated
  5. T+90s: All 10 instances updated, 2 surge pods terminated

Result: Update completed in 90 seconds with zero dropped requests.

Key success factors:

  • Fast health checks (5s interval)
  • Proper readiness probe (waited for DB connections)
  • Sufficient surge capacity (20% overhead acceptable)
  • Application graceful shutdown (30s drain time)

Conclusion

Rolling updates represent a mature, battle-tested approach to zero-downtime deployments. They’re not magic—they require careful configuration, robust health checks, and monitoring—but when implemented correctly, they enable teams to deploy with confidence.

The key takeaways:

  1. Health checks are non-negotiable: Invest time in making them accurate
  2. Understand your parameters: maxSurge and maxUnavailable control everything
  3. Plan for the worst: Implement graceful shutdown and connection draining
  4. Monitor actively: Deployments should be observable events, not black boxes
  5. Know the alternatives: Rolling updates aren’t always the right choice

In production, the ability to update systems without downtime isn’t just a nice-to-have—it’s table stakes. Master rolling updates, and you’ll have a deployment strategy that scales from small services to massive distributed systems.

Now go forth and deploy fearlessly. Your users won’t even notice.