HorizontalPodAutoscaler (HPA) allow you to dynamically scale the replica count of your Deployment based on criteria such as memory or CPU utilization, which make it great way to manage spikes in utilization while still keeping your cluster size and infrastructure costs managed effectively.
In order for HPA to evaluate CPU and memory utilization and take action, you need the metrics-server enabled in your cluster as a prerequisite.
Example Deployment/Service
The first thing you should apply into your Kubernetes cluster is the sample golang-hello-world-web Service and Deployment. This is a tiny containerized web server I wrote in GoLang (source).
# apply into Kubernetes cluster $ kubectl apply -f https://raw.githubusercontent.com/fabianlee/alpine-apache-benchmark/main/kubernetes-hpa/golang-hello-world-web.yaml service/golang-hello-world-web-service created deployment.apps/golang-hello-world-web created # wait for deployment to be ready $ kubectl rollout status deployment golang-hello-world-web -n default --timeout=90s deployment "golang-hello-world-web" successfully rolled out # Deployment has '1' replica $ kubectl get deployment golang-hello-world-web NAME READY UP-TO-DATE AVAILABLE AGE golang-hello-world-web 1/1 1 1 66s # and exposed via Service $ kubectl get service golang-hello-world-web-service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE golang-hello-world-web-service ClusterIP 10.43.227.130 <none> 8080/TCP 46s
Notice this deployment has an explicit ‘spec.template.spec.containers[0].resources’ block which sets CPU and memory request and limits. This makes the HPA evaluation of “high” utilization explicit.
# spec.template.spec.containers[0] resources: requests: cpu: 100m # 1/10th of core (1000m is 1 vcpu) memory: 32M # 32 MegaBytes limits: cpu: 250m # 1/4th of core (1000m is 1 vcpu) memory: 48M # 32 MegaBytes
Example HorizontalPodAutoscaler
Now apply golang-hello-world-web-hpa.yaml that creates the HorizontalPodAutoscaler (HPA).
kubectl apply -f https://raw.githubusercontent.com/fabianlee/alpine-apache-benchmark/main/kubernetes-hpa/golang-hello-world-web-hpa.yaml $ kubectl apply -f golang-hello-world-web-hpa.yaml horizontalpodautoscaler.autoscaling/golang-hello-world-web-hpa created # min pods=1, if cpu utilization high=3 $ kubectl get hpa golang-hello-world-web-hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE golang-hello-world-web-hpa Deployment/golang-hello-world-web 2%/25% 1 3 1 42s
Features of our HPA
Before continuing, there are several features of the golang-hello-world-web-hpa.yaml definition we should describe in more detail.
- It uses the ‘autoscaling/v2’ API group, which is the only valid group K8S 1.26+
- It uses ‘minReplicas’ to define the minimum number of replicas
- It uses ‘maxReplicas’ to define the maximum number of replicas
- It scales the deployment replica count according to CPU utilization, looking at the 25% average utilization as a trigger
- It overrides the default scale down policy to:
- give a 15 second stabilization window before performing any action
- only scales down 1 pod at a time
- forces wait of 20 seconds before allowing another scale event
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: golang-hello-world-web-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: golang-hello-world-web minReplicas: 1 maxReplicas: 3 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 25 #scaleUp: scaleDown: stabilizationWindowSeconds: 15 # seconds wait before scaling down policies: - type: Pods value: 1 # scale down 1 pod at a time periodSeconds: 20 # seconds wait before another scale down allowed selectPolicy: Max
Load test the Deployment
Finally, let’s test the HPA ability to scale up and down the replica count of this Deployment by placing a load on the web service using the ‘ab‘ Apache Benchmark utility.
Instead of exposing the golang-hello-world-web-service in an ingress specific manner, let’s just deploy another container that has the Apache Benchmark utility and run the load test against the internal IP of the service. I wrote ‘alpine-apache-benchmark‘ for this purpose.
# fetch internal IP and port of service service_name=golang-hello-world-web-service service_IP=$(kubectl get service $service_name --no-headers -o=custom-columns=IP:.spec.clusterIP) service_port=$(kubectl get service $service_name --no-headers -o=custom-columns=PORT:.spec.ports[0].port) # smoke test of curl to service $ kubectl run -i --rm --tty load-generator --image=ghcr.io/fabianlee/alpine-apache-benchmark:1.0.2 --restart=Never --command curl -- http://$service_IP:$service_port/myhello/ Hello, World request 20 GET /myhello/ Host: 10.43.227.130:8080 pod "load-generator" deleted # run load test that fetches the service 60K times, simulating 20 simultaneous users $ kubectl run -i --rm --tty load-generator --image=ghcr.io/fabianlee/alpine-apache-benchmark:1.0.2 --restart=Never --command ab -- -n60000 -c20 http://$service_IP:$service_port/myhello/
This load test should have placed sufficient CPU load on the deployment to invoke a scale up event, where now you see 3 replicas (instead of just 1).
# run load test again if replica count does not increase $ kubectl get deployment golang-hello-world-web NAME READY UP-TO-DATE AVAILABLE AGE golang-hello-world-web 3/3 3 3 31m # events on this deployment confirm scale up events occured $ kubectl get events --namespace default --field-selector involvedObject.type=deployment --field-selector involvedObject.name=golang-hello-world-web --sort-by='.lastTimestamp' LAST SEEN TYPE REASON OBJECT MESSAGE 32m Normal ScalingReplicaSet deployment/golang-hello-world-web Scaled up replica set golang-hello-world-web-7b6cb7b7cb to 1 112s Normal ScalingReplicaSet deployment/golang-hello-world-web Scaled up replica set golang-hello-world-web-7b6cb7b7cb to 3 from 1 # wait for a minute sleep 60 # scale down events start taking replicas back down to 1 $ kubectl get events --namespace default --field-selector involvedObject.type=deployment --field-selector involvedObject.name=golang-hello-world-web --sort-by='.lastTimestamp' LAST SEEN TYPE REASON OBJECT MESSAGE 32m Normal ScalingReplicaSet deployment/golang-hello-world-web Scaled up replica set golang-hello-world-web-7b6cb7b7cb to 1 112s Normal ScalingReplicaSet deployment/golang-hello-world-web Scaled up replica set golang-hello-world-web-7b6cb7b7cb to 3 from 1 82s Normal ScalingReplicaSet deployment/golang-hello-world-web Scaled down replica set golang-hello-world-web-7b6cb7b7cb to 2 from 3 52s Normal ScalingReplicaSet deployment/golang-hello-world-web Scaled down replica set golang-hello-world-web-7b6cb7b7cb to 1 from 2 # replicas will eventually get back down to original 1 $ kubectl get deployment golang-hello-world-web NAME READY UP-TO-DATE AVAILABLE AGE golang-hello-world-web 1/1 1 1 35m
REFERENCES
Kubernetes metrics-server installation
kubernetes sig-autoscaling, HPA
google docs, kubernetes request and limits on cpu/mem
docker-ab jib, source code for Alpine image with ab
github fabianlee, alpine-apache-benchmark source and github action to build OCI
github fabianlee, docker-golang-hello-world-web small goLang web server