Anthos GKE on-prem is a managed platform that brings GKE clusters to on-premise datacenters. In this article, I will be following the steps required to perform a minor-version upgrade from 1.9.1 to 1.9.2 on VMware.
I will be using the same environment and config files described in my Anthos 1.9 installation article.
Overview
The proper order for a minor-version upgrade is:
- Download the newer gkeadm tool
- Upgrade the Admin Workstation
- Prepare the regular bundle
- Upgrade the User Clusters
- Upgrade the Admin Cluster
Login to the initial seed VM
The initial seed VM is the guest used to create the Admin Workstation.
cd anthos-nested-esx-manual export project_path=$(realpath .) # login to intial VM used to create the Admin Workstation ssh -i ./tf-kvm-seedvm/id_rsa ubuntu@192.168.140.220
Download the newer gkeadm tool (from the Seed VM)
Download the newer 1.9 gkeadm tool. From the previous installation, we have gkeadm version 1.9.1-gke.6, now we need to download version 1.9.2-gke.4
cd ~/seedvm gsutil cp gs://gke-on-prem-release/gkeadm/1.9.2-gke.4/linux/gkeadm ./gkeadm192 chmod +x gkeadm192 ./gkeadm192 version
Upgrade the Admin Workstation (from the Seed VM)
Use the admin-ws-config.yaml used to initially setup the Admin Workstation, and the resulting into file which matches the name of the Admin Workstation.
$ ./gkeadm192 upgrade admin-workstation --config admin-ws-config.yaml --info-file gke-admin-ws Running validations... - Validation Category: Tools - [SUCCESS] gcloud - [SUCCESS] ssh - [SUCCESS] ssh-keygen - [SUCCESS] scp - Validation Category: Config Check - [SUCCESS] Config - Validation Category: Internet Access - [SUCCESS] Internet access to required domains - Validation Category: GCP Access - [SUCCESS] Read access to GKE on-prem GCS bucket - Validation Category: vCenter - [SUCCESS] Credentials - [SUCCESS] vCenter Version - [SUCCESS] ESXi Version - [SUCCESS] Datacenter - [SUCCESS] Datastore - [SUCCESS] Resource Pool - [SUCCESS] Folder - [SUCCESS] Network All validation results were SUCCESS. Upgrading admin workstation "gke-admin-ws" from version "1.9.1-gke.6" to version "1.9.2-gke.4"... Generating local backup of admin workstation VM "gke-admin-ws"... DONE Reusing VM template "gke-on-prem-admin-appliance-vsphere-1.9.2-gke.4" that already exists in vSphere. Do not cancel (double ctrl-c) while the admin workstation "gke-admin-ws" is being decommissioned. Doing so may result in an unrecoverable state. Decommissioning original admin workstation VM "gke-admin-ws"... DONE Do not cancel (double ctrl-c) once the new admin workstation VM has been created. Doing so may result in an unrecoverable state. Creating admin workstation VM "gke-admin-ws-1-9-2-gke-4-1639101422"... DONE Waiting for admin workstation VM "gke-admin-ws-1-9-2-gke-4-1639101422" to be assigned an IP.... DONE ****************************************** Admin workstation VM successfully created: - Name: gke-admin-ws-1-9-2-gke-4-1639101422 - IP: 192.168.140.221 - SSH Key: /home/ubuntu/.ssh/gke-admin-workstation ****************************************** Deleting admin workstation VM "gke-admin-ws"... DONE Renaming new admin workstation "gke-admin-ws-1-9-2-gke-4-1639101422" to "gke-admin-ws" Printing gkectl and docker versions on admin workstation... gkectl version gkectl 1.9.2-gke.4 (git-bc0f7f419) Add --kubeconfig to get more version information. docker version Client: Version: 19.03.2 API version: 1.40 Go version: go1.12.9 Git commit: 6a30dfca03 Built: Tue Oct 13 16:38:16 2020 OS/Arch: linux/amd64 Experimental: false Server: Engine: Version: 19.03.2 API version: 1.40 (minimum version 1.12) Go version: go1.12.9 Git commit: 6a30dfca03 Built: Tue Oct 13 14:47:06 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.11-0ubuntu0~18.04.1~anthos1 GitCommit: runc: Version: 1.0.0~rc95-0ubuntu1~18.04.1~anthos1 GitCommit: docker-init: Version: 0.18.0 GitCommit: Checking NTP server on admin workstation... ntptime ntp_gettime() returns code 0 (OK) time e55d346d.05a5119c Fri, Dec 10 2021 2:05:33.022, (.022050549), maximum error 9500 us, estimated error 0 us, TAI offset 0 ntp_adjtime() returns code 0 (OK) modes 0x0 (), offset 188.028 us, frequency 0.000 ppm, interval 1 s, maximum error 9500 us, estimated error 0 us, status 0x2001 (PLL,NANO), time constant 2, precision 0.001 us, tolerance 500 ppm, Getting component access service account... Preparing "credential.yaml" for gkectl... Copying files to admin workstation... - vcenter.ca.pem - anthos-allowlisted.json - /tmp/gke-on-prem-vcenter-credentials717079708/credential.yaml Updating admin-cluster.yaml for gkectl... ******************************************************************** Admin workstation is ready to use. WARNING: file already exists at "/home/ubuntu/seedvm/gke-admin-ws". Overwriting. Admin workstation information saved to /home/ubuntu/seedvm/gke-admin-ws This file is required for future upgrades SSH into the admin workstation with the following command: ssh -i /home/ubuntu/.ssh/gke-admin-workstation ubuntu@192.168.140.221 ********************************************************************
This command does a backup of the files on your current Admin Workstation, kubeconfig, root certs, and json files; then creates a newer Admin Workstation and copies those files back onto it.
As stated in the output of the command, you will temporarily see a new VM named “gke-admin-ws-1-9-2-gke-4-xxxxxx” in vCenter. However, this is only a temporary name until the older Admin WS is deleted.
The backing vmdk disk for the AdminWS (‘dataDiskName’ in admin-ws-config.yaml) is re-attached to this new VM.
Test the connection by checking the uptime of the new Admin WS.
ssh -i /home/ubuntu/.ssh/gke-admin-workstation ubuntu@192.168.140.221 "uptime" # get AdminWS private key from seed, then exit back to host cat /home/ubuntu/.ssh/gke-admin-workstation exit
Then cat the gke-admin-workstation private key so we can copy it back to our host and login to the new Admin WS.
cd $project_path/needed_on_adminws # paste in AdminWS private key vi gke-admin-workstation # clear old fingerprint to AdminWS ssh-keygen -f ~/.ssh/known_hosts -R 192.168.140.221 # login to new Admin WS ssh -i $project_path/needed_on_adminws/gke-admin-workstation ubuntu@192.168.140.221 # reset the ssh server timeout (destroyed during rebuild) ./adminws_ssh_increase_timeout.sh # back to host exit
Prepare regular bundle (from the Admin WS)
The regular bundle should already be available, since this is the 1.9.2-gke.4 version of the Admin Workstation.
# login to new Admin WS ssh -i $project_path/needed_on_adminws/gke-admin-workstation ubuntu@192.168.140.221 # view current bundles on AdminWorkstation # the new 1.9.2.-gke.4 should already be here $ ls -l /var/lib/gke/bundles -rw-r--r-- 1 root root 6932607013 Jan 1 2000 gke-onprem-vsphere-1.9.2-gke.4-full.tgz -rw-r--r-- 1 root root 278349 Jan 1 2000 gke-onprem-vsphere-1.9.2-gke.4.tgz # view bundles currently in use by admin and user clusters $ gkectl version --kubeconfig /home/ubuntu/kubeconfig --details gkectl version: 1.9.2-gke.4 (git-bc0f7f419) onprem user cluster controller version: 1.9.1-gke.6 current admin cluster version: 1.9.1-gke.6 current user cluster versions (VERSION: CLUSTER_NAMES): - 1.9.1-gke.6: user1 available admin cluster versions: - 1.9.1-gke.6 available user cluster versions: - 1.9.1-gke.6 Info: The admin workstation and gkectl can be upgraded to "1.10" if needed. Info: The admin cluster can't be upgraded to "1.10", because there are still "1.9" user clusters.
This shows us that the Admin and User cluster are still using 1.9.1-gke.6. Now we need to prepare the 1.9.2-gke.4 bundle.
# prepare bundle $ gkectl prepare --bundle-path /var/lib/gke/bundles/gke-onprem-vsphere-1.9.2-gke.4.tgz --kubeconfig /home/ubuntu/kubeconfig - Validation Category: Config Check - [SUCCESS] Config - Validation Category: Internet Access - [SUCCESS] Internet access to required domains - Validation Category: GCP - [SUCCESS] GCP service - [SUCCESS] GCP service account - Validation Category: Container Registry - [SUCCESS] Docker registry access - Validation Category: VCenter - [SUCCESS] Credentials - [SUCCESS] vCenter Version - [SUCCESS] ESXi Version - [SUCCESS] Datacenter - [SUCCESS] Datastore - [SUCCESS] Resource pool - [SUCCESS] Folder - [SUCCESS] Network All validation results were SUCCESS. Logging in to gcr.io/gke-on-prem-release Reusing VM template "gke-on-prem-ubuntu-1.9.2-gke.4" that already exists in vSphere. Reusing VM template "gke-on-prem-cos-1.9.2-gke.4" that already exists in vSphere. Applying Bundle CRD YAML... DONE Applying Bundle CRs... DONE Applied bundle in the admin cluster.
Upgrade the User Clusters (from the Admin WS)
Before upgrading to 1.9.2-gke.4, make sure the cluster is registered in the Anthos>Clusters of the Cloud Console (https://console.cloud.google.com).
Also, there needs to be at least one free IP address from the user-block.yaml to accommodate the serial creation of a new worker node.
Run the gkectl command as shown below using the Admin Cluster kubeconfig and the config used to originally setup the User Cluster.
$ gkectl upgrade cluster --kubeconfig /home/ubuntu/kubeconfig --config user-cluster.yaml -v 3 Reading config with version "v1" - Validation Category: Config Check - [SUCCESS] Config - Validation Category: Ingress Running validation check for "User cluster Ingress"... / W1210 03:03:57.549068 3916 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1 - [SUCCESS] User cluster Ingress - Validation Category: OS Images - [SUCCESS] User OS images exist - Validation Category: Cluster Health Running validation check for "Admin cluster health"... | - [SUCCESS] Admin cluster health Running validation check for "Admin PDB"... | W1210 03:04:00.106744 3916 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable i - [SUCCESS] Admin PDB - [SUCCESS] User cluster health Running validation check for "User PDB"... | W1210 03:04:01.541725 3916 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable i - [SUCCESS] User PDB - Validation Category: Reserved IPs - [SUCCESS] Admin cluster reserved IP for upgrading cluster - [SUCCESS] User cluster reserved IP for upgrading a user cluster - Validation Category: GCP - [SUCCESS] GCP service - [SUCCESS] GCP service account - Validation Category: Container Registry - [SUCCESS] Docker registry access - Validation Category: VCenter - [SUCCESS] Credentials - [SUCCESS] VSphere CSI Driver All validation results were SUCCESS. Upgrading to bundle version: "1.9.2-gke.4" Updating onprem cluster controller component status in admin package deployment... DONE Upgrading onprem user cluster controller... ... Upgrading onprem user cluster controller... - Upgrading onprem user cluster controller... DONE Reading config with version "v1" Seesaw Upgrade Summary: OS Image updated (old -> new): "gke-on-prem-ubuntu-1.9.1-gke.6" -> "gke-on-prem-ubuntu-1.9.2-gke.4" Upgrading loadbalancer "seesaw-for-user1" Deleting LB VM: seesaw-for-user1-lhpp8lnvk9-1... DONE Creating new LB VMs... DONE Saved upgraded Seesaw group information of "seesaw-for-user1" to file: seesaw-for-user1.yaml Waiting LBs to become ready... DONE Updating create-config secret... DONE Loadbalancer "seesaw-for-user1" is successfully upgraded. Skipping admin cluster backup since clusterBackup section is not set in admin cluster seed config Waiting for user cluster "user1" to be ready... | Waiting for user cluster "user1" to be ready... DONE Creating or updating user cluster control plane workloads: deploying user-kube-apiserver-base, user-control-plane-base, user-control-plane-clusterapi-vsphere, user-control-plane-etcddefrag, user-control-plane-konnectivity-server: 0/1 statefulsets are ready... Creating or updating user cluster control plane workloads... Creating or updating user cluster control plane workloads: 15/16 pods are ready... Creating or updating node pools: pool-1: hasn't been seen by controller yet... Creating or updating node pools: pool-1: 1/3 replicas are updated... Creating or updating node pools: pool-1: 2/3 replicas are updated... Creating or updating node pools: pool-1: Creating or updating node pool... Creating or updating node pools: pool-1: 4/3 replicas show up... Creating or updating addon workloads: 3/4 machines are ready... Creating or updating addon workloads: 44/50 pods are ready... Cluster is running... Skipping admin cluster backup since clusterBackup section is not set in admin cluster seed config Done upgrading user cluster user1. Done upgrade
This upgrades the load balancers first, then the User Cluster control plane and finally the User Cluster worker nodes.
During this process, in vCenter you will see the newer “gke-on-prem-ubuntu-1.9.2-gke.4” template being cloned as newer worker nodes are spun up to replace the older versions.
Also, invocations of “kubectl get nodes” will show node versions being swapped up serially as new nodes are brought in and older ones deleted. There are small windows of time when there are N+1 worker nodes.
# called half-way through the upgrade process # notice one more than usual 3 nodes # 1 disabled older, soon to be destroyed # 2 worker nodes at older version, and 1 at newer version $ kubectl --kubeconfig user1-kubeconfig get nodes NAME STATUS ROLES AGE VERSION user-host1 Ready 3h37m v1.21.5-gke.400 user-host2 NotReady,SchedulingDisabled 3h37m v1.21.5-gke.400 user-host3 Ready 3h37m v1.21.5-gke.400 user-host4 Ready 2m39s v1.21.5-gke.1200
If this upgrade failed half-way through, you would want to invoke the exact same command but with the “skip-validation-all” flag to resume the upgrade.
Checking kubectl version at this point would show the User Cluster at the newer version, while the Admin cluster is still at the older version.
# User Cluster server version at newer v1.21.5-gke.1200 $ kubectl --kubeconfig user1-kubeconfig version | grep 'Server Version' Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5-gke.1200", GitCommit:"90a16981ade07f163a0233adb631b42ac1fc53ff", GitTreeState:"clean", BuildDate:"2021-10-04T09:25:23Z", GoVersion:"go1.16.7b7", Compiler:"gc", Platform:"linux/amd64"} # Admin Cluster server version still at older 1.21.5-gke.400 $ kubectl --kubeconfig kubeconfig version | grep 'Server Version' Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5-gke.400", GitCommit:"d26178e3a4094b39049028dbfb24f716f342de3b", GitTreeState:"clean", BuildDate:"2021-09-22T09:29:07Z", GoVersion:"go1.16.7b7", Compiler:"gc", Platform:"linux/amd64"}
Upgrade the Admin Cluster (from the Admin WS)
There needs to be at least one free IP address from the admin-block.yaml to accommodate the serial creation of a new master nodes.
Run the gkectl command as shown below using the Admin Cluster kubeconfig and the config used to originally setup the Admin Cluster.
$ gkectl upgrade admin --kubeconfig kubeconfig --config admin-cluster.yaml -v 3 Reading config with version "v1" Reading bundle at path: "/var/lib/gke/bundles/gke-onprem-vsphere-1.9.2-gke.4-full.tgz". - Validation Category: Config Check - [SUCCESS] Config - Validation Category: OS Images - [SUCCESS] Admin OS images exist - Validation Category: Cluster Health Running validation check for "Admin cluster health"... | - [SUCCESS] Admin cluster health - [SUCCESS] Admin PDB - [SUCCESS] All user clusters health - Validation Category: Reserved IPs - [SUCCESS] Admin cluster reserved IP for upgrading cluster - Validation Category: GCP - [SUCCESS] GCP service - [SUCCESS] GCP service account - Validation Category: Container Registry - [SUCCESS] Docker registry access - Validation Category: VCenter - [SUCCESS] Credentials All validation results were SUCCESS. Upgrading to bundle version "1.9.2-gke.4" Reading config with version "v1" Seesaw Upgrade Summary: OS Image updated (old -> new): "seesaw-os-image-v1.8-20211118-3370aa09b5" -> "gke-on-prem-ubuntu-1.9.2-gke.4" Upgrading loadbalancer "seesaw-for-gke-admin" Deleting LB VM: seesaw-for-gke-admin-v88mp6gkgm-1... DONE Creating new LB VMs... DONE Saved upgraded Seesaw group information of "seesaw-for-gke-admin" to file: seesaw-for-gke-admin.yaml Waiting LBs to become ready... DONE Updating create-config secret... DONE Loadbalancer "seesaw-for-gke-admin" is successfully upgraded. Skipping admin cluster backup since clusterBackup section is not set in admin cluster seed config Creating cluster "gkectl" ... DEBUG: docker/images.go:67] Pulling image: gcr.io/gke-on-prem-release/kindest/node:v0.11.1-gke.25-v1.21.5-gke.1200 ... ✓ Ensuring node image (gcr.io/gke-on-prem-release/kindest/node:v0.11.1-gke.25-v1.21.5-gke.1200) 🖼 ✓ Preparing nodes 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 ✓ Waiting ≤ 5m0s for control-plane = Ready ⏳ • Ready after 34s 💚 Waiting for external cluster control plane to be healthy... | Waiting for external cluster control plane to be healthy... DONE Applying admin bundle to external cluster Applying Bundle CRD YAML... DONE Applying Bundle CRs... DONE W1205 21:52:08.950054 5945 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition ... Waiting for external cluster cluster-api to be ready... DONE Resuming existing create/update internal cluster with existing kubeconfig at /home/ubuntu/kubeconfig. Remove file or specify another path if resume is not desired Pivoting existing Cluster API objects from internal to external cluster... DONE Waiting for cluster to be ready for external cluster... DONE Provisioning master vm for internal cluster via external cluster Creating cluster object gke-admin-t9dpd on external cluster... DONE Applying admin master bundle components... / ... Applying admin master bundle components... DONE Creating master... DONE Updating admin cluster checkpoint... DONE Updating external cluster object with master endpoint... DONE Creating internal cluster Getting internal cluster kubeconfig... DONE Waiting for internal cluster control plane to be healthy... / Waiting for internal cluster control plane to be healthy... DONE Applying admin bundle to internal cluster Applying Bundle CRD YAML... DONE Applying Bundle CRs... - I1205 22:06:52.439823 5945 request.go:655] Throttling request took 1.129237238s, request: GET:https://192.168.141.222 Applying Bundle CRs... DONE ... Waiting for internal cluster cluster-api to be ready... DONE Pivoting Cluster API objects from external to internal cluster... DONE Waiting for admin addon and user master nodes in the internal cluster to become ready... DONE Waiting for control plane to be ready... DONE Waiting for kube-apiserver VIP to be configured on the internal cluster... DONE Applying admin node bundle components... / I1205 22:21:02.563106 5945 request.go:655] Throttling request took 1.119178224s, request: GET:https://192.168.141.222Applying admin node bundle components... DONE Creating node Machines in internal cluster... DONE Pruning unwanted admin node bundle components... / ... Applying admin addon bundle to internal cluster... DONE Waiting for admin cluster system workloads to be ready... DONE Waiting for admin cluster machines and pods to be ready... DONE Pruning unwanted admin base bundle components... - ... Pruning unwanted admin addon bundle components... DONE Waiting for admin cluster system workloads to be ready... DONE Waiting for admin cluster machines and pods to be ready... DONE Cleaning up external cluster... DONE Skipping admin cluster backup since clusterBackup section is not set in admin cluster seed config Trigger reconcile on user cluster 'user1/user1-gke-onprem-mgmt' to upgrade its user master VMs to the same version "1.9.2-gke.4" as the admin cluster Waiting for reconcile to complete... DONE Done upgrading admin cluster.
This upgrades the load balancers first, then the Admin Cluster nodes.
Invocations of “kubectl get nodes” will show master node versions being swapped up serially as new nodes are brought in and older ones deleted. There are small windows of time when there are N+1 worker nodes.
# called half-way through the upgrade process # notice 2 nodes at older version, and 2 at newer version $ kubectl --kubeconfig kubeconfig get nodes NAME STATUS ROLES AGE VERSION admin-host1 Ready control-plane,master 25m v1.21.5-gke.1200 admin-host2 Ready 4h49m v1.21.5-gke.400 admin-host3 Ready 4h49m v1.21.5-gke.400 admin-host5 Ready 6m26s v1.21.5-gke.1200
Below you can see that kubectl now reports the newer version when connecting to the Admin Cluster.
# Admin Cluster server version now at newer v1.21.5 $ kubectl --kubeconfig kubeconfig version | grep 'Server Version' Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5-gke.1200", GitCommit:"90a16981ade07f163a0233adb631b42ac1fc53ff", GitTreeState:"clean", BuildDate:"2021-10-04T09:25:23Z", GoVersion:"go1.16.7b7", Compiler:"gc", Platform:"linux/amd64"}
Admin Cluster upgrades are not resumable, you would want to contact Google Support if it fails.
REFERENCES