Product Development and Consulting in Digital Security, Data Science and Enterprise Architecture:

Architecture, Digital Security, Machine Learning, Big Data

Deploying with Confidence - Canary and Shadow Deployments with Nginx

Views 180 | Likes0 | Dislikes 0

Description:

In today's fast-paced development environment, deploying new features and updates to production can be a nerve-wracking experience. It might feel like walking a tightrope. How do you ensure that changes don't disrupt the user experience? One wrong step and your users could face outages, performance issues, or critical bugs. This fear often leads to slow, cautious release cycles, hindering innovation. This is where canary and shadow deployments come into play, and Nginx can be your powerful ally. Canary and shadow deployments can be your safety nets for risk-free releases and Nginx, your versatile reverse proxy and load balancer, is an excellent tool to make them happen.

The Deployment Dilemma

Imagine launching a new feature that crashes your production environment, or rolling out an update that introduces subtle data corruption. Traditional "big bang" deployments, where a new version replaces the old one entirely, are inherently risky. If something goes wrong, the impact is immediate and widespread, leading to frantic rollbacks, unhappy customers, and potential revenue loss. Even well-tested software can behave unexpectedly in a live production environment with its unique traffic patterns and data.

This is where progressive delivery strategies like Canary and Shadow deployments shine, allowing you to test in production with controlled exposure.

What are Canary Deployments?

Inspired by the canaries used in coal mines to detect toxic gases, a canary deployment involves releasing a new version of your application to a small subset of your users (the "canary group") before making it available to everyone. Think of it as a "test group."  If the new version performs well and no issues are detected, you gradually increase the traffic routed to it until it serves 100% of the users. If problems arise, you can quickly revert the small canary group to the old version, minimizing impact.

Here are key characteristics of canary deployments:

  • Limited exposure: Only a small percentage of users are routed to the new version.
  • Real-world testing: Allows you to observe the new version's performance under real-world conditions.
  • Rollback possibility: If issues arise, you can quickly revert to the stable version.

What are Shadow Deployments?

Shadow deployments differ in that they send real traffic to both the existing production version and the new version simultaneously, without affecting user experience. Also known as traffic mirroring, shadow deployment sends a copy of live production traffic to a new version of your application that runs in parallel. The key difference from canary is that the responses from the shadowed (new) version are ignored by the client. This allows you to test the new application's behavior (performance, errors, resource consumption) under realistic load without impacting your live users.

Key features of shadow deployments include:

  • Mirroring traffic: Real user requests are copied and sent to the new version.
  • Performance comparison: Enables you to compare the performance of the new version against the production version.
  • Zero user impact: Users are not affected by any issues with the new version.
  • Bug Detection: Uncover bugs that only manifest under specific production conditions.
  • High Fidelity Testing: Test with real production data and request patterns.

Why Nginx?

Nginx, a traffic architect, is a lightweight and scalable load balancer which can handle 100K+ requests/sec, and with dynamic configuration feature, can hot reload without downtime. Moreover, its built-in modules help in implementing different deployment strategies seamlessly, particularly split_clients for canary and mirror for shadow deployment.

And the advanced version of Nginx, Nginx Plus, has a new feature key‑value store for HTTP traffic. This feature provides an API for dynamically maintaining values that can be

used as part of the NGINX Plus configuration, without requiring a reload of the configuration. This feature helps us to dynamically update the split percentage without doing re-installation or restart of nginx server.

Using Nginx for Canary Deployments

Nginx, acting as a reverse proxy and load balancer, can intelligently route traffic based on various criteria, making it perfect for managing canary releases. Nginx facilitates canary by:

  • Percentage-based Routing - Route a tiny percentage of traffic to the new version.
  • User Segment-based Routing - Route traffic based on specific headers, cookies (e.g., internal beta testers), or IP addresses.

Nginx Configuration Example (Percentage-based):

A common approach for percentage-based routing in Nginx without complex modules is to use a hash based on some client identifier (like IP or a cookie).

Nginx.conf

http {

  # Define traffic split logic. Generate a consistent hash for percentage routing (e.g., based on client IP) set $hash_value $remote_addr; 

# Or combine with $uri, $http_user_agent etc.

  split_clients "$remote_addr" $canary_version {

    10%     "v2";   # 10% to new version

    *       "v1";   # Rest to current version

  }


  upstream backend_v1 {

    server 10.0.0.1:80;

  }


  upstream backend_v2 {

    server 10.0.0.2:80;

  }


  server {

    listen 80;
    

    location / {

      # Route based on split logic

      set $backend $canary_version;

      proxy_pass http://backend_$backend;

      

      # Monitor errors (integration with Prometheus)

      proxy_next_upstream error timeout;

    }

  }

}

In this example:

  • 90% of the traffic is routed to the stable server.
  • 10% is routed to the canary server.

Implementing Shadow Deployments with Nginx

Shadow deployment, also known as traffic mirroring, sends a copy of live production traffic to a new version of your application that runs in parallel. The key difference from canary is that the responses from the shadowed (new) version are ignored by the client. This allows you to test the new application's behavior (performance, errors, resource consumption) under realistic load without impacting your live users

Shadow deployments can be achieved using Nginx's ngx_http_mirror_module. This module offers a mirror directive which is specifically designed for this purpose, allowing it to send a copy of a request to a mirrored server or location.

Nginx.conf

http {

  upstream primary {

    server 10.0.0.1:80;   # Current production (v1)

  }

  upstream shadow {

    server 10.0.0.2:80;   # New version (v2)

  }


  server {

    listen 80;
    

    location / {

      # Send primary traffic to v1

      proxy_pass http://primary;

      
      # Duplicate traffic to v2 (ignores responses)

      mirror /mirror;

      mirror_request_body on;

    }


    location = /mirror {

      internal;  # Hide from external access

      proxy_pass http://shadow$request_uri;

      

      # Prevent shadow responses to client

      proxy_intercept_errors on;

      proxy_pass_request_body on;

      proxy_pass_request_headers on;

    }

  }

}

This setup sends traffic to the stable server while also mirroring it to the shadow server for evaluation.

Zero-Downtime Releases with Built-in Safety Nets

Here's a complete Nginx configuration implementing canary deployments, shadow deployments, and critical fallback mechanisms to handle failures automatically. The fallback location configuration which takes care of instant solutions to standard errors thrown by the new system can be configured for other error codes as we deem fit. It can serve as built-in safety nets for zero-downtime release.

# ===== ERROR HANDLING =====

error_page 500 502 503 504 /fallback;

location = /fallback {
     # Final fallback to stable

     proxy_pass http://<stable_backend>;
     access_log /var/log/nginx/fallback.log;
}

But there can be functional or other issues as well where standard error codes are not thrown. In this scenario, we should be able to increase or decrease the split percentage for incoming traffic.

Nginx.conf

http {

    # Upstream definitions

    upstream stable_backend {
        server 10.0.0.1:80;   # Primary stable version
        server 10.0.0.2:80;   # Secondary stable instance
        keepalive 32;
    }

    upstream canary_backend {
        server 10.0.0.3:80;   # Canary version
        server 10.0.0.4:80 backup;  # Fallback server
    }

    upstream shadow_backend {
        server 10.0.0.5:80;   # Shadow version
    }


    # Health check endpoint

    server {
        listen 127.0.0.1:9000;
        location /health {
            access_log off;
            return 200 "OK";
            add_header Content-Type text/plain;
        }
    }

    # Main server configuration

    server {
        listen 80;
        server_name app.example.com;

        # ===== CANARY DEPLOYMENT =====
        # Traffic split: 10% to canary, 90% to stable

        split_clients "${remote_addr}${http_user_agent}" $canary_version {
            10%     "canary";
            *       "stable";
        }

        # ===== FALLBACK MECHANISM =====
        # Check canary health before routing

        map $canary_version $backend {
            "canary"  ${canary_healthy};
            default   "stable_backend";
        }

        # Health check for canary backend

        map $upstream_status $canary_healthy {
            default   "stable_backend";  # Fallback to stable if canary fails
            ~^[23]   "canary_backend";   # Use canary if healthy (2xx/3xx)
        }


        # ===== SHADOW DEPLOYMENT =====
        # Conditional mirroring based on canary health

        map $canary_healthy $shadow_active {
            "canary_backend"   1;  # Mirror only when canary is healthy
            default            0;
        }

        # ===== MAIN LOCATION BLOCK =====

        location / {
            # Primary request routing
            proxy_pass http://$backend;
            proxy_set_header Host $host;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;            

            # Health check for canary backend

            proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
            health_check uri=/health interval=5s fails=2 passes=2;

            # Shadow traffic (mirroring)

            mirror /mirror;  # Primary mirror
            mirror_request_body on;

        }

        # ===== SHADOW MIRROR LOCATION =====

        location = /mirror {
            internal;            

            # Conditional mirroring

            if ($shadow_active = 0) { return 200; }  # Skip if shadow inactive            

            proxy_pass http://shadow_backend$request_uri;
            proxy_set_header X-Shadow-Request "true";
            proxy_set_header Host $host;
        

            # Critical: Prevent shadow responses from affecting client

            proxy_intercept_errors on;
            proxy_ignore_client_abort on;
            proxy_pass_request_body on;
            proxy_pass_request_headers on;            

            # Fast-fail settings

            proxy_connect_timeout 1s;
            proxy_read_timeout 2s;
            proxy_send_timeout 2s;
        }


        # ===== ADMIN OVERRIDES =====
        # Force canary via header (for testing)

        location @canary_override {
            proxy_pass http://canary_backend;
        }


        # Force stable version (emergency fallback)

        location @stable_fallback {
            proxy_pass http://stable_backend;
        }


        # Header-based routing

        if ($http_x_canary = "true") {
            rewrite ^ @canary_override last;
        }

        
        if ($http_x_force_stable = "true") {
            rewrite ^ @stable_fallback last;
        }

        # ===== ERROR HANDLING =====

        error_page 500 502 503 504 /fallback;
        location = /fallback {
            # Final fallback to stable

            proxy_pass http://stable_backend;
            access_log /var/log/nginx/fallback.log;
        }
    }
}

The above nginx setup has a three-layer fallback system.

In this configuration, there is a health check for canary. Shadow traffic is sent only when the canary is healthy. Moreover, timeouts prevent shadow from affecting the primary.

Administrators can override and force canary or stable routing during testing.

​​​​​​​curl -H "X-Canary: true"

curl -H "X-Force-Stable: true"

Operation Playbook for Canary Deployment

During the transition phase, it is important to constantly monitor the new system for any issues or failure or performance. If there is an issue, we should be able to take remedial or fallback action swiftly. If it's performing well, we can gradually increase the traffic on the new system while still monitoring it.

As mentioned above, Nginx Plus gives a major advantage with respect to seamlessly ramping up or down the traffic on the new system. By leveraging the key-val api, the split percentage can be updated in the runtime. Let’s see how.

Define mapping of different split percentages in the nginx configuration file.

# Set up a key‑value store to specify the percentage to send to each upstream group based on the 'Host' header.

    keyval_zone zone=split:64k state=/etc/nginx/state_files/split.json;
    keyval $host $split_level zone=split;

    split_clients $split_param $split0 {
        *   old;
    }

    split_clients $split_param $split5 {
        5%  new;
        *   old;
    }

    split_clients $split_param $split10 {
        10% new;
        *   old;
    }

    split_clients $split_param $split25 {
        25% new;
        *   old;
    }

    split_clients $split_param $split50 {
        50% new;
        *   old;
    }

    split_clients $split_param $split100 {
        *   new;
    }


    map $split_level $migration {
        0        $split0;
        5        $split5;
        10       $split10;
        25       $split25;
        50       $split50;
        100      $split100;
        default  $split0;
    }


# In each 'split_clients' block above, '$split_param' controls which application receives each request. For a production application, we set it to '$remote_addr' (the client IP address).

server {
—---
—---
set $split_param $remote_addr;
—---
—---
}

 We then make a call to the nginx api to first create a split keyval attribute in the Nginx and make changes to the split percentage.

For example,

curl -iX POST -d '{<ip_address>:50}' http://localhost:8008/api/9/http/keyvals/split/
curl -iX PATCH -d '{<ip_address>:0}' http://localhost:8008/api/9/http/keyvals/split/

The first call creates the split attribute with value 50, which is 50% split. The second call updates the split to 0%, i.e. no request to the new system. We can choose any value from the mapping to increase or decrease the split.

To view what is the current split

curl -X GET http://localhost:8008/api/9/http/keyvals/split/

Please note that the Nginx API server is deployed at port 8008. We can choose an appropriate port for its deployment in the API configuration file. If the port is exposed then we can call the API from outer systems as well. But it's recommended to keep it secure.

Steps

  1. Verify the normal flow. Send all traffic to the old system. Use the below instruction to set up a 0% split to the new system.
curl -iX PATCH -d '{<ip_address>:0}' http://localhost:8008/api/9/http/keyvals/split/

In case of issues, verify that the old system url is correctly set and the network is configured properly to accept requests from Nginx Plus Server.

  1. Once the old system is verified, split the traffic to send 5% of it to new. Use the below instruction to update the split.
curl -iX PATCH -d '{<ip_address>:5}' http://localhost:8008/api/9/http/keyvals/split/

Verify that the system is working as expected. The new system is handling traffic successfully.

  • If there is no traffic to the new system, then check the new system url and the network is configured properly to accept requests from Nginx Plus server. You can check the new system log files as well for other issues.
  • If there is an error or issue with the new system by checking the log file, rollback the split to 0% as in step 1 and let all traffic pass to the old system.
  1. After the successful step 2, increase the traffic to the new system to 10% and monitor the performance.
curl -iX PATCH -d '{<ip_address>:10}' http://localhost:8008/api/9/http/keyvals/split/

        If there is an increase in error rates, decrease the traffic to 5% to the new system, follow step 2.

  1. Further increase the traffic to 25% to the new system on successful step 3. If issues, rollback to split 10%.
  1. Again, increase the traffic to 50% after successful step 4.
  2. To verify performance in dynamic fluctuation of load, decrease the load to 25%.
curl -iX PATCH -d '{<ip_address>:25}' http://localhost:8008/api/9/http/keyvals/split/
  1. Finally, increase the load to 100% to the new system if the performance is satisfactory in earlier steps.
curl -iX PATCH -d '{<ip_address>:100}' http://localhost:8008/api/9/http/keyvals/split/
  1. The proxy has a fallback location set up for automatically transferring traffic to the old system if the new system throws the configured error code and the issue might not be visible. So, it is important to constantly monitor the error codes in the new system logs and take appropriate action if necessary.

 

Conclusion

Canary and Shadow deployments, powerfully enabled by Nginx, transform the intimidating process of deploying new software into a controlled, low-risk, and highly informative exercise. By intelligently directing or mirroring traffic, you gain the confidence to innovate rapidly, test rigorously in real environments, and ensure a seamless experience for your users.

 

Login to like or dislike

Comments


Login to add a new comment