HorizontalPodAutoscaler (HPA) allow you to dynamically scale the replica count of your Deployment based on criteria such as memory or CPU utilization, which make it great way to manage spikes in utilization while still keeping your cluster size and infrastructure costs managed effectively.

In order for HPA to evaluate CPU and memory utilization and take action, you need the metrics-server enabled in your cluster as a prerequisite.

Example Deployment/Service

The first thing you should apply into your Kubernetes cluster is the sample golang-hello-world-web Service and Deployment. This is a tiny containerized web server I wrote in GoLang (source).

# apply into Kubernetes cluster
$ kubectl apply -f https://raw.githubusercontent.com/fabianlee/alpine-apache-benchmark/main/kubernetes-hpa/golang-hello-world-web.yaml
service/golang-hello-world-web-service created
deployment.apps/golang-hello-world-web created

# wait for deployment to be ready
$ kubectl rollout status deployment golang-hello-world-web -n default --timeout=90s
deployment "golang-hello-world-web" successfully rolled out

# Deployment has '1' replica
$ kubectl get deployment golang-hello-world-web
NAME READY UP-TO-DATE AVAILABLE AGE
golang-hello-world-web 1/1 1 1 66s

# and exposed via Service
$ kubectl get service golang-hello-world-web-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
golang-hello-world-web-service ClusterIP 10.43.227.130 <none> 8080/TCP 46s

Notice this deployment has an explicit ‘spec.template.spec.containers[0].resources’ block which sets CPU and memory request and limits. This makes the HPA evaluation of “high” utilization explicit.

# spec.template.spec.containers[0]
        resources:
          requests:
            cpu: 100m # 1/10th of core (1000m is 1 vcpu)
            memory: 32M # 32 MegaBytes
          limits:
            cpu: 250m # 1/4th of core (1000m is 1 vcpu)
            memory: 48M # 32 MegaBytes

Example HorizontalPodAutoscaler

Now apply golang-hello-world-web-hpa.yaml that creates the HorizontalPodAutoscaler (HPA).

kubectl apply -f https://raw.githubusercontent.com/fabianlee/alpine-apache-benchmark/main/kubernetes-hpa/golang-hello-world-web-hpa.yaml

$ kubectl apply -f golang-hello-world-web-hpa.yaml
horizontalpodautoscaler.autoscaling/golang-hello-world-web-hpa created

# min pods=1, if cpu utilization high=3
$ kubectl get hpa golang-hello-world-web-hpa
NAME                         REFERENCE                           TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
golang-hello-world-web-hpa   Deployment/golang-hello-world-web   2%/25%    1         3         1          42s

Features of our HPA

Before continuing, there are several features of the golang-hello-world-web-hpa.yaml definition we should describe in more detail.

It uses the ‘autoscaling/v2’ API group, which is the only valid group K8S 1.26+
It uses ‘minReplicas’ to define the minimum number of replicas
It uses ‘maxReplicas’ to define the maximum number of replicas
It scales the deployment replica count according to CPU utilization, looking at the 25% average utilization as a trigger
It overrides the default scale down policy to:
- give a 15 second stabilization window before performing any action
- only scales down 1 pod at a time
- forces wait of 20 seconds before allowing another scale event

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: golang-hello-world-web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: golang-hello-world-web
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 25
    #scaleUp:
    scaleDown:
      stabilizationWindowSeconds: 15 # seconds wait before scaling down
      policies:
      - type: Pods
        value: 1 # scale down 1 pod at a time
        periodSeconds: 20 # seconds wait before another scale down allowed
      selectPolicy: Max

Load test the Deployment

Finally, let’s test the HPA ability to scale up and down the replica count of this Deployment by placing a load on the web service using the ‘ab‘ Apache Benchmark utility.

Instead of exposing the golang-hello-world-web-service in an ingress specific manner, let’s just deploy another container that has the Apache Benchmark utility and run the load test against the internal IP of the service. I wrote ‘alpine-apache-benchmark‘ for this purpose.

# fetch internal IP and port of service
service_name=golang-hello-world-web-service
service_IP=$(kubectl get service $service_name --no-headers -o=custom-columns=IP:.spec.clusterIP)
service_port=$(kubectl get service $service_name --no-headers -o=custom-columns=PORT:.spec.ports[0].port)

# smoke test of curl to service
$ kubectl run -i --rm --tty load-generator --image=ghcr.io/fabianlee/alpine-apache-benchmark:1.0.2 --restart=Never --command curl -- http://$service_IP:$service_port/myhello/
Hello, World
request 20 GET /myhello/
Host: 10.43.227.130:8080
pod "load-generator" deleted

# run load test that fetches the service 60K times, simulating 20 simultaneous users
$ kubectl run -i --rm --tty load-generator --image=ghcr.io/fabianlee/alpine-apache-benchmark:1.0.2 --restart=Never --command ab -- -n60000 -c20 http://$service_IP:$service_port/myhello/

This load test should have placed sufficient CPU load on the deployment to invoke a scale up event, where now you see 3 replicas (instead of just 1).

# run load test again if replica count does not increase
$ kubectl get deployment golang-hello-world-web
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
golang-hello-world-web   3/3     3            3           31m

# events on this deployment confirm scale up events occured
$ kubectl get events --namespace default --field-selector involvedObject.type=deployment --field-selector involvedObject.name=golang-hello-world-web --sort-by='.lastTimestamp'
LAST SEEN   TYPE     REASON              OBJECT                              MESSAGE
32m         Normal   ScalingReplicaSet   deployment/golang-hello-world-web   Scaled up replica set golang-hello-world-web-7b6cb7b7cb to 1
112s        Normal   ScalingReplicaSet   deployment/golang-hello-world-web   Scaled up replica set golang-hello-world-web-7b6cb7b7cb to 3 from 1

# wait for a minute
sleep 60

# scale down events start taking replicas back down to 1
$ kubectl get events --namespace default --field-selector involvedObject.type=deployment --field-selector involvedObject.name=golang-hello-world-web --sort-by='.lastTimestamp'
LAST SEEN   TYPE     REASON              OBJECT                              MESSAGE
32m         Normal   ScalingReplicaSet   deployment/golang-hello-world-web   Scaled up replica set golang-hello-world-web-7b6cb7b7cb to 1
112s        Normal   ScalingReplicaSet   deployment/golang-hello-world-web   Scaled up replica set golang-hello-world-web-7b6cb7b7cb to 3 from 1
82s         Normal   ScalingReplicaSet   deployment/golang-hello-world-web   Scaled down replica set golang-hello-world-web-7b6cb7b7cb to 2 from 3
52s         Normal   ScalingReplicaSet   deployment/golang-hello-world-web   Scaled down replica set golang-hello-world-web-7b6cb7b7cb to 1 from 2

# replicas will eventually get back down to original 1
$ kubectl get deployment golang-hello-world-web
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
golang-hello-world-web   1/1     1            1           35m

REFERENCES

Kubernetes metrics-server installation

kubernetes sig-autoscaling, HPA

Kubernetes docs, HPA

kubernetes.io, how HPA work

cjyabraham, HPA walthrough

openshift docs, HPA

google docs, kubernetes request and limits on cpu/mem

Apache Benchmark

docker-ab jib, source code for Alpine image with ab

github fabianlee, alpine-apache-benchmark source and github action to build OCI

github fabianlee, docker-golang-hello-world-web small goLang web server