Kubernetes: fixing x509 certificate errors from metric-server on K3s cluster

K3s is deployed by default with a metrics-server, but if you have a multi-node cluster it will fail unless you add the names of all the nodes to the kube-apiserver certificate.  Symptoms of this problem include:

  • metrics-server deployment will throw x509 errors in its log
  • Error when you try to run “kubectl top pods”
  • No evaluation of HorizontalPodAutoscaler based on cpu/memory utilization.

Root problem

If you are on a single node K3s cluster,  then all K3s component are running on the same host and therefore the certificate SAN name is always valid.  The metrics-server has a requirement to communicate with the kubelet of each host, and since the K3s kubelet, kube-apiserver, and metrics-server are all on the same hostname (e.g. ‘mymaster’), all using the same certificate (serving-kube-apiserver.crt), the certificate is evaluated as valid.

But on a multi-node K3s cluster, when the metrics-server securely reaches out to the kubelet of one of the nodes (e.g. ‘mynode1’ ) for metrics, the name may not match a SAN name from the kubelet certificate, and will be evaluated as invalid.

You can view the SAN names of the certificate being used by the kubelet (an integrated subprocess of kube-apiserver in K3s) by running the following command from the K3s master.

sudo openssl x509 -in /var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt -text -noout | grep "Subject Alternative Name:" -A1

If the SAN names do not include the hostnames and/or IP of all your cluster nodes, then you will have the x509 errors described.

Resolution

The metrics-server reuses the certificate in use by the kubelet/kube-apiserver.  So, we need to modify the certificate used by the kube-apiserver to add the names of the worker nodes to the SAN list of the certificate.

This can be done with the ‘tls-san’ flag.  If you are using “/etc/rancher/k3s/config.yaml” to configure the K3s master, then add the hostname and/or IP of your worker nodes in the cluster.

tls-san: [ "node1", "node2", "192.168.122.214", "192.168.122.215" ]

Then run the certificate rotation process from the K3s master to regenerate the api-server certificate.

sudo systemctl stop k3s
sudo k3s --debug certificate rotate --service api-server
sudo systemctl start k3s

This will add these entries to the SAN list of the certificate and resolve the x509 errors coming from metrics-server when trying to communicate with other nodes.

You can validate the SAN list change on the certificate using the openssl command from the “problem” section above.

Client Validation

After giving the metrics-service a couple of minutes to stabilize and then gather information from the set of nodes in your cluster, ‘kubectl top’ should work if this is successfully resolved.

kubectl top pods

 

Comment on other resolutions

There are a lot of other threads and github issues where people have issues with the K3s metrics-server collecting data, and workarounds include:

  • Setting ‘spec.template.spec.hostNetwork=true’
  • Adding the ‘−−kubelet-insecure-tls’ flag to ‘.spec.template.spec.containers[0].args’
  • Disabling the K3s metrics bundle ‘−−disable metrics-server’ when invoking K3s, and then installing the metric-server from its official site
  • Adding ‘requestheader-allowed-names’ flag as kube-apiserver-arg

I did not find any of these necessary once I added all the worker nodes to the SAN names of the certificate as described.

 

REFERENCES

kubernetes-sigs, metrics-server and its requirements (such as ability to reach kubelet securely)

stackoverflow, K3s kubelet does not have its own process (integrated into k3s) but does have its own configuration flags in config.yaml, with example

github k3s issue, kubelet built into k3s as subprocess, also has example of kubelet-arg in config.yaml

k3s docs, kubelet-arg flag for customization

kubernetes docs, kubelet and its arguments

k3s-io wiki, K3s certificate rotation

k3s-docs, tls-san flag in config.yaml

k3s issues, tls-san multiple or with commas

k3s issues, tls-san csv format

k3s issues, need tls-san for joining cluster

github issue, using kubelet-preferred-address-points

github issue, k3s metrics-server hostNetwork does NOT need to be enabled

github issue, k3s requestheader-allowed-names for metrics-server fix

 

NOTES

Show certificate being used by kubeapi-server using openssl and hitting kube-apiserver

IP_and_port=$(yq '.clusters[0].cluster.server' < $KUBECONFIG | sed 's#https://##')
IP=$(echo $IP_and_port | cut -d: -f1)

# show SAN names of certificate
echo | openssl s_client -showcerts -servername $IP -connect $IP_and_port 2>/dev/null | openssl x509 -inform pem -noout -text | grep "Subject Alternative Name" -A1

Show arguments being used by metrics-server deployment

$ kubectl get deployment metrics-server -n kube-system -o=yaml | yq '.spec.template.spec.containers[0].args'
- --cert-dir=/tmp
- --secure-port=10250
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305