GCP: Enabling autoUpgrade for node-pools to reduce manual maintenance

GKE cluster upgrades do not need to be a manual process.  GKE clusters can be auto upgraded by subscribing the cluster to an appropriate release channel and assigning a sensible maintenance window.  As long as adequate pod disruption budgets, replicas, and ingress are configured, these upgrades can happen without interrupting  availability.

To check the current settings of your GKE cluster use the commands below.

# get list of clusters and their locations
gcloud container clusters list

# set cluster name from list above
cluster_name=xxxxxx
# set either --region or --zone from cluster above
location_flag="--region=xxxxx"

# set node pool name
gcloud container node-pools list --cluster=$cluster_name $location_flag
nodepool_name=xxxxx

# which release channel, if any?
gcloud container clusters describe $cluster_name $location_flag --format="value(releaseChannel)"
# autoUpgrade true ?
gcloud container clusters describe $cluster_name $location_flag --format="value(nodePools.management)"

# maintenance window
gcloud container clusters describe $cluster_name $location_flag --format="value(maintenancePolicy)"

Example: Cluster set to auto-upgrade on STABLE channel, but only on weekend nights

If you had a production cluster, you may want it to stay current on the well-tested and slower moving STABLE channel only upgrading on Friday/Saturday nights, when possible issues could be corrected before many customers were affected.

# view versions available in STABLE channel
gcloud container get-server-config --format "json(channels)" $location_flag | jq '.channels[] | select(.channel=="STABLE")'

# set release channel at cluster level
gcloud container clusters update $cluster_name $location_flag --release-channel STABLE

# maintenance time set to FRI+SAT 0500-1000 UTC (1a-6a ET)
# dates are just models, pick an arbitrary Friday
gcloud container clusters update $cluster_name $location_flag --maintenance-window-start="2022-05-13T05:00:00Z" --maintenance-window-end "2022-05-13T10:00:00Z" --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=FR,SA"

# enable auto-upgrade at node pool level
# this WILL cause a node pool rebuild if first time set
gcloud container node-pools update $nodepool_name --cluster=$cluster_name $location_flag --enable-autoupgrade

Example: Cluster set to auto-upgrade on RAPID channel, but only on weekday mornings

If you have a development cluster, you may want it to stay on the leading edge RAPID channel with upgrades happening during North American mornings, when issues could quickly be addressed by Engineering groups as they started their work day.

# view versions available in RAPID channel
gcloud container get-server-config --format "json(channels)" $location_flag | jq '.channels[] | select(.channel=="RAPID")'

# set release channel at cluster level
gcloud container clusters update $cluster_name $location_flag --release-channel RAPID

# maintenance time set to MON-FRI 1000-1500 GMT (6-11a ET) 
# dates are just models, pick an arbitrary Monday
gcloud container clusters update $cluster_name $location_flag --maintenance-window-start="2022-05-16T10:00:00Z" --maintenance-window-end "2022-05-16T15:00:00Z" --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=MO,TU,WE,TH,FR"

# enable auto-upgrade at node pool level
# this WILL cause a node pool rebuild if first time set
gcloud container node-pools update $nodepool_name --cluster=$cluster_name $location_flag --enable-autoupgrade

 

Example: disable Cluster auto upgrade abilities

If you experienced any issues with availability during these maintenance windows, you should perform manual upgrades on a development cluster, paying close attention to pod disruption budgets and number of replicas used for deployments.

Disabling auto upgrade and emptying the release channel does not require rebuilding the node pool.  These settings can be applied without any disruption to service.

# set empty release channel at cluster level
gcloud container clusters update $cluster_name $location_flag --release-channel None

# disable autoUpgrade at node-pool level
gcloud container node-pools update $nodepool_name --cluster=$cluster_name $location_flag --no-enable-autoupgrade

 

REFERENCES

Google ref, autoupgrade nodes

Google ref, subscribing and unsubscribing from channel release (STABLE,REGULAR)

Google ref, GKE maintenance window

Google ref, configure a maintenance window

Google ref, versions available for each channel type

gcloud, node-pools update

NOTES

list of recent cluster level operations (recent upgrades done)

gcloud container operations list --sort-by=start_time

versions available for each channel type

gcloud container get-server-config --format "yaml(channels)" --zone|--region <location>

If you want to speed up node pool recreation,

# current values
gcloud container clusters describe $cluster_name $location_flag --format="value(nodePools.upgradeSettings)"

# set higher values for more parallelism of node rebuild
gcloud container node-pools update $nodepool_name --cluster=$cluster_name --max-surge-upgrade=3 --max-unavailable-upgrade=3 $location_flag