GCP: Cloud Run/Function to handle requests to GKE cluster during maintenance

At some point, there will be a system change significant enough that a maintenance window needs to be scheduled with customers.   But that doesn’t mean the end-user traffic or client integrations will stop requesting the services.

What we need to present to end-users is a maintenance page during this outage to indicate the overall solution is not available.

An in-cluster service could be appropriate if this change would not cause cluster-wide or service-level interruptions.  But if the availability cannot be ensured, then one way to address this is by rerouting all traffic from the HTTPS LB to an independent Cloud Run app or Cloud Function that delivers a maintenance page and code.

We assume in this article that the GKE cluster is using Anthos Service Mesh/Istio and fronted by a public GCP HTTPS LoadBalancer created by an Ingress object.  The solution in this article can use either an independent Cloud Run app or Cloud Function service to create a serverless NEG that can be swapped at the target-https-proxies level as illustrated below.

LoadBalancer Solution Overview

A GCP global external HTTP Load Balancer is composed of multiple resources as described in the documentation.

In order to to capture all incoming requests to our public GCP HTTPS LB and direct them to our maintenance Cloud Run/Function, we will modify the existing taget-https-proxies object. Instead of pointing at the urls-maps object of your GKE cluster, will will instead point it at the url-maps object of the maintenance backend service as shown below.

You may question why we are modifying at this level, and not instead modifying the url-maps default service OR updating the membership or capacity/rate levels of the backend-services to make this change.

We could make this change by instead modifying the existing url-maps default service, but non-trivial production configurations have additional path-matcher and host rules that would need to be removed.  This introduces complexity.

However, we could not make this change at the backend-services level because the backend-services for a serverless neg cannot have a health check.    Meanwhile, the backend-services for an unmanaged instance group or regular neg does require a health-check.  So even if we wanted to do the heavy work of shifting membership and rates/capacity of the backend service, this solution is not technically viable.

Independent backend #1, Cloud Run maintenance app

The first independent maintenance application we will create is a Python-based Cloud Run application.

# grab project code
git clone https://github.com/fabianlee/gcp-cloudrun-maintenance-app.git
cd gcp-cloudrun-maintenance-app

# deploy and do test curl
$ ./deploy_to_cloudrun.sh
...
Cloud Run URL: https://gcp-cloudrun-maintenance-app-2t3stxot4q-ue.a.run.app
...

From a browser, the Cloud Run app looks like below.

With the deployment proven, we create a regional Network Endpoint Group from the Cloud Run application that allows us to create a backend-service and url-map.

# show Cloud Run apps
gcloud run services list

# setup variables
app_name=gcp-cloudrun-maintenance-app

# get region of Cloud Run app
region=$(gcloud run services list --format="value(metadata.labels)" --filter="metadata.name = $app_name" | grep -Po "location=\K.*")
gcloud run services describe $app_name --region=$region

# create regional serverless NEG from Cloud Run app
gcloud compute network-endpoint-groups create ${app_name}-neg --region=$region --network-endpoint-type=serverless --cloud-run-service=$app_name
gcloud compute network-endpoint-groups list

# backend is global to match global GCP LB
gcloud compute backend-services create ${app_name}-backend --load-balancing-scheme=EXTERNAL --global
gcloud compute backend-services add-backend ${app_name}-backend --global --network-endpoint-group=${app_name}-neg --network-endpoint-group-region=$region

# show backend
gcloud compute backend-services list --global
gcloud compute backend-services describe ${app_name}-backend --global --format=json

# create global url-map (shows up as load balancer name in console web UI)
gcloud compute url-maps create ${app_name}-lb1 --default-service=${app_name}-backend --global
gcloud compute url-maps add-path-matcher ${app_name}-lb1 --path-matcher-name=${app_name}-hostmatcher --path-rules="/*=${app_name}-backend" --default-service=${app_name}-backend --global
gcloud compute url-maps describe ${app_name}-lb1

# get url-map link
cloud_run_url_map_link=$(gcloud compute url-maps describe ${app_name}-lb1 --format="value(selfLink)")
echo "cloud_run_url_map_link = $cloud_run_url_map_link"

This ‘cloud_run_url_map_link’ is the value we will later substitute into the target-https-proxies to swap the load balancer traffic.

Independent backend #2, Cloud Function maintenance service

The second independent maintenance service we will create is a Python-based Cloud Function.

# get project code
git clone https://github.com/fabianlee/gcp-https-lb-vms-cloudfunc.git
cd gcp-https-lb-vms-cloudfunc
cd roles/gcp-cloud-function-gen2/files

# deploy and do test curl
$ ./deploy_to_cloud_function.sh
...
Cloud Function at: https://maintgen2-2t3stxot4q-ue.a.run.app
...

From a browser, the Cloud Function looks like below.

With the deployment proven, we create a regional Network Endpoint Group from the Cloud Function that allows us to create a backend-service and url-map.

# show Cloud Functions
gcloud functions list
# notice Function also has entry in Cloud Run list
gcloud run services list

# setup variables
app_name=maintgen2

# get region of Cloud Function, available from its Cloud Run object
region=$(gcloud run services list --format="value(metadata.labels)" --filter="metadata.name = $app_name" | grep -Po "location=\K[^;]*")
gcloud functions describe $app_name --region=$region

# create regional serverless NEG from Cloud Function
gcloud compute network-endpoint-groups create ${app_name}-neg --region=$region --network-endpoint-type=serverless --cloud-function-name=$app_name
gcloud compute network-endpoint-groups list

# backend is global to match global GCP LB
gcloud compute backend-services create ${app_name}-backend --load-balancing-scheme=EXTERNAL --global
gcloud compute backend-services add-backend ${app_name}-backend --global --network-endpoint-group=${app_name}-neg --network-endpoint-group-region=$region

# show backend
gcloud compute backend-services list --global
gcloud compute backend-services describe ${app_name}-backend --global --format=json

# create global url-map (shows up as load balancer name in console web UI)
gcloud compute url-maps create ${app_name}-lb1 --default-service=${app_name}-backend --global
gcloud compute url-maps add-path-matcher ${app_name}-lb1 --path-matcher-name=${app_name}-hostmatcher --path-rules="/*=${app_name}-backend" --default-service=${app_name}-backend --global
gcloud compute url-maps describe ${app_name}-lb1

# get url-map link
cloud_function_url_map_link=$(gcloud compute url-maps describe ${app_name}-lb1 --format="value(selfLink)")
echo "cloud_function_url_map_link = $cloud_function_url_map_link"

This ‘cloud_function_url_map_link’ is the value we will later substitute into the target-https-proxies to swap the load balancer traffic.

Swap url-maps to change Load balancer traffic during maintenance

Save current Load Balancer settings

The first step is to save the current url-maps link, so traffic can be re-routed after the maintenance period is over.

# view Ingress objects deployed to GKE cluster
kubectl get Ingress -A

# select the 'Ingress' object representing your target LB
k8s_ingress_ns=<namespaceOfIngress>
k8s_ingress_name=<nameOfIngress>
kubectl get ingress $k8s_ingress_name -n $k8s_ingress_ns

# show load balancer used by cluster Ingress
lb_name=$(kubectl get ingress -n $k8s_ingress_ns $k8s_ingress_name -o=custom-columns=LB:".metadata.annotations.ingress\.kubernetes\.io/url-map" --no-headers)
lb_ip_name=$(kubectl get ingress $k8s_ingress_name -n $k8s_ingress_ns -o=jsonpath="{.metadata.annotations.kubernetes\.io/ingress\.global-static-ip-name}")
echo "cluster is using GCP LB name/IP name: $lb_name/$lb_ip_name"
gcloud compute url-maps list
gcloud compute url-maps describe $lb_name

# show target-proxies used by LB
target_proxy_name=$(kubectl get ingress -n $k8s_ingress_ns $k8s_ingress_name -o=custom-columns=LB:".metadata.annotations.ingress\.kubernetes\.io/https-target-proxy" --no-headers)
echo "cluster is using target proxy: $target_proxy_name"
gcloud compute target-https-proxies describe $target_proxy_name

# show url-map used by LB
saved_url_map=$(gcloud compute target-https-proxies describe $target_proxy_name --format="value(urlMap)")
echo "saved_url_map = $saved_url_map"

This ‘saved_url_map’ value is the one we will replace in the target-https-proxies when the maintenance period is over.

Disable Kubernetes Ingress to GCP synchronization

The Kubernetes Ingress object periodically synchronizes its expected values to the GCP LoadBalancer.  If this is left enabled it will revert the changes we are making to the underlying target-https-proxies, so we need to pause this synchronization process.

The easiest way to do this is to intentionally “break” it by setting an incorrect value in the Ingress annotations.  This will halt the Ingress synchronization.

echo "existing name of global LB IP is: $lb_ip_name"

# overwrite annotation with non-existent name
kubectl annotate --overwrite ingress $k8s_ingress_name -n $k8s_ingress_ns 'kubernetes.io/ingress.global-static-ip-name'=DOESNOTEXIST

# will now say 'Error syncing to GCP', which is what we want
kubectl describe ingress $k8s_ingress_name -n $k8s_ingress_ns  --show-events | grep Error

I have chosen to use the global-static-ip-name annotation, but any of the standard annotations would work.

Swap in traffic from Cloud Run application

# swap in Cloud Run url-map
gcloud compute target-https-proxies update $target_proxy_name --url-map="$cloud_run_url_map_link" --global

In 60-240 seconds, you will see several blips in availability as the swap is made to the GCP resources, and then all traffic to the external HTTPS LB will be served by the Cloud Run application.

Swap in traffic from Cloud Function service

In the same manner as above, the traffic can be rerouted to the url-maps of our Cloud Function.

# swap in Cloud Function url-map
gcloud compute target-https-proxies update $target_proxy_name --url-map="$cloud_function_url_map_link" --global

In 60-240 seconds, you will see several blips in availability as the swap is made to the GCP resources, and then all traffic to the external HTTPS LB will be served by the Cloud Function.

Restore original Load Balancer settings

When the scheduled maintenance period is over, and the GKE cluster and services are ready to serve traffic again, the original url-maps link can be swapped back in and Ingress synchronization re-enabled by correcting the annotation value.

# restore original url-map
gcloud compute target-https-proxies update $target_proxy_name --url-map="$saved_url_map" --global

# resume normal Ingress synchronization by correcting the annotation
kubectl annotate --overwrite ingress $k8s_ingress_name -n $k8s_ingress_ns 'kubernetes.io/ingress.global-static-ip-name'=$lb_ip_name

In 60-240 seconds, you will see several blips in availability as the swap is made to the GCP resources, and then traffic will be restored to the GKE cluster and the Anthos Service Mesh.

 

REFERENCES

fabianlee github, Python Flask/gunicorn Cloud Run maintenance app

fabianlee github, Python Cloud Function maintenance app

google ref, what is Cloud Run

google codelabs, example code for Cloud Run

google ref, gcloud run deploy

google ref, gcloud run services describe

google ref, Artifact registry and creating docker repo

google ref, Python runtimes available

google ref, Cloud Run with sample Python

google ref, Cloud Run troubleshooting

google ref, Cloud Run building containers

google ref, balancing modes (conn,rate,utilization)

google ref, external HTTPS LB for Cloud Functions

google ref, serverless neg concepts and diagrams

google ref, external HTTPS LB overview

google ref, decision tree for choosing a load balancer type

google ref, external HTTPS LB architecture explanation and diagram

google, backend-services

google ref, target-https-proxies update

google, creating LB health checks

google, http-health-checks

Kishore Jagganath, nuts and bolts of NEGs