GCP: troubleshooting nodepool replica changes for Anthos on-premise

If you are executing gkectl commands that scale or modify your worker nodes and hit a problem, the first place to go is into the gkectl logs.  But if you need to dig deeper there are a couple of CRD that can assist in troubleshooting.

First, make sure you have executed gkectl with high verbosity “-v5”, and examine the logs on the Admin Workstation at “/home/ubuntu/.config/gke-on-prem/logs”.

If you need to dig deeper into why new or rebuilt worker nodes are not being provisioned, then look into the ‘machinedeployment’ and ‘machine’ CRD.

# will show on-premise provider details and expected replica counts
kubectl describe machinedeployment

Then examine the events on any ‘machine’ type that might indicate an error during the creation or modification of a worker node VM.

# shows list of 'machine'
kubectl describe machines

# look at details of specific machine
# e.g. ipam error if IP cannot be allocated
kubectl describe machine <machineId>

# view any events involving Machine
kubectl get events --field-selector involvedObject.kind=Machine --all-namespaces

If one of the ‘machine’ objects has an error, it can be deleted and the Admin Cluster will attempt a recreation

Also, if additional static IP entries are required to support new node replicas, these can be added manually by editing the cluster as below.

# if new static IP definitions required
kubectl edit cluster

 

REFERENCES

google, gkectl update cluster

google, troubleshooting gkectl issues

google, managine and creating nodepool

bluematador.com, kubectl get events with involvedObject.kind

netapp docs, deploying additional user clusters with on-prem anthos