Kubernetes: HorizontalPodAutoscaler evaluation based on Prometheus metric

HorizontalPodAutoscaler (HPA) allow you to dynamically scale the replica count of your Deployment based on basic CPU/memory resource metrics from the metrics-server.  If you want scaling based on more advanced scenarios and you are already using the Prometheus stack, the prometheus-adapter provides this enhancement.

The prometheus-adapter takes basic Prometheus metrics, and then synthesizes custom API metrics which can be used as a HorizontalPodAutoscaler trigger.

As an example, in addition to the basic Prometheus metrics (deployment counts, CPU, memory) this would allow you take any metric you exposed via Prometheus (e.g. incoming queue size, database row count, process/thread count, average blocking time, Java GC) and use this to trigger a HorizontalPodAutoscaler.

Prerequisite installation and validation

Before configuring the prometheus-adapter, you need a Kubenetes cluster:

metrics-server installation

Per the official docs, you can install using the latest manifest.

# check for installation of metrics-server
kubectl get deployment,service,serviceaccount metrics-server -n kube-system

# if it does not exist, install using manifest
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

metrics-server validation

# check status of metrics-server
kubectl rollout status deployment -n kube-system metrics-server --timeout=90s
kubectl get deployment/metrics-server -n kube-system

# API group 'metrics.k8s.io/v1beta1' should have an entry if metrics-server healthy
kubectl api-versions | grep "^metrics.k8s.io"

# basic metrics should be available
sudo apt install -y jq
kubectl get --raw /apis/metrics.k8s.io/v1beta1
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods | jq
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq

# 'kubectl top' should report back metrics
kubectl top pods
kubectl top nodes

If these basic calls fail, then there is an issue with your cluster or metrics-server configuration.

If you need customization of the metrics-server, the container args can be modified for setting preferred address types and allowing insecure TLS.

kube-prometheus stack installation

In its simplest form (as described in the official docs), you install the monitoring stack using helm.

# set variables
prom_release_name=prom-stack
prom_release_ns=prom
prom_service_name=prom-stack-kube-prometheus-prometheus

# add helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

# validate helm repo was added
helm repo list

helm repo update prometheus-community

# create the namespace
kubectl create ns $prom_release_ns

# install monitoring stack
helm install --namespace $prom_release_ns $prom_release_name prometheus-community/kube-prometheus-stack

# check status of prometheus stack release
kubectl --namespace prom get pods -l "release=$prom_release_name"

# check status of helm install
helm status $prom_release_name -n $prom_release_ns
# check values used during helm installation
helm get values $prom_release_name -n $prom_release_ns

If you need to customize, see the values.yaml , which can then be passed to ‘helm install/upgrade’ using the ‘-f’ flag.

kube-prometheus stack validation

Validate the presence of the basic components below.

$ kubectl get deployments,ds,statefulset -n $prom_release_ns
NAME                                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/prom-stack-kube-prometheus-operator   1/1     1            1           2d18h
deployment.apps/prom-stack-kube-state-metrics         1/1     1            1           2d18h
deployment.apps/prom-stack-grafana                    1/1     1            1           2d18h

NAME                                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/prom-stack-prometheus-node-exporter   3         3         3       3            3           kubernetes.io/os=linux   2d18h

NAME                                                                    READY   AGE
statefulset.apps/alertmanager-prom-stack-kube-prometheus-alertmanager   1/1     2d18h
statefulset.apps/prometheus-prom-stack-kube-prometheus-prometheus       1/1     2d18h

# this service IP:port will be used later by the prometheus-adapter
$ kubectl get service $prom_service_name -n $prom_release_ns
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prom-stack-kube-prometheus-prometheus ClusterIP 10.43.235.50 <none> 9090/TCP,8080/TCP 2d18h

Installing the prometheus-adapter

With validation of the metrics-server returning values via the API and the basic Prometheus monitoring stack done, we can now focus on the prometheus-adapter.  This is the piece responsible for consuming Prometheus metrics, and synthesizing these into custom metrics exposed via the API.

Take note, the prometheus-adapter metrics are NOT going to be stored in Prometheus or queried via PromQL.  They can only be pulled from the kube API, using a client such as kubectl.  The HorizontalPodAutoscaler is able to make its evaluation based on custom metrics available via the API.

Installing prometheus-adapter

# add helm repo, will say "already exists" if Prometheus installed with same repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

# validate helm repo was added
helm repo list

# set variables for prometheus-adapter
adapter_release_name=adapter-release
adapter_release_ns=prom
adapter_deployment_name=adapter-release-prometheus-adapter

# set variables for Prometheus service
prom_release_ns=prom
prom_service_name=prom-stack-kube-prometheus-prometheus
# get Prometheus service connection values
prom_service_IP=$(kubectl get service $prom_service_name -n $prom_release_ns -o=jsonpath='{.spec.clusterIP}')
prom_service_port=$(kubectl get service $prom_service_name -n $prom_release_ns -o=jsonpath='{.spec.ports[?(@.name=="http-web")].port}')
echo "connect to $prom_service_name service at $prom_service_IP : $prom_service_port"

# install helm chart with basic set of values
helm install $adapter_release_name prometheus-community/prometheus-adapter --namespace=$adapter_release_ns --set prometheus.url=http://$prom_service_name.$prom_release_ns.svc --set prometheus.port=$prom_service_port

# view helm chart installation status
helm history $adapter_release_name -n $adapter_release_ns --max=1

# view values used in latest installation
helm get values $adapter_release_name -n $adapter_release_ns

# check status of deployment
kubectl rollout status deployment -n $adapter_release_ns $adapter_deployment_name --timeout=90s
kubectl get deployment -n $adapter_release_ns $adapter_deployment_name

Register API

You will not be able to use kubectl to query “custom.metrics.k8s.io/v1beta1” until you register the prometheus-adapter as a custom Metrics API service registered with the API aggregator.  To do this, apply the following file.

# make sure jq utility is installed for json parsing
sudo apt install -y jq

# this will fail until API registration is done
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta2 | jq

# register custom API service
cat <<EOF | kubectl apply -f -
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta2.custom.metrics.k8s.io
spec:
  group: custom.metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: adapter-release-prometheus-adapter
    namespace: prom
  version: v1beta2
  versionPriority: 100
EOF

# validate registration
kubectl api-versions | grep "^custom.metrics.k8s.io/v1beta2"
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta2 | jq

This APIService can also be applied using ‘kubectl apply -f https://raw.githubusercontent.com/fabianlee/k3s-cluster-kvm/main/roles/prometheus-adapter/templates/api-service.yaml’

Create custom rule for scraping Prometheus metrics

The prometheus-adapter has a basic rule set for taking Prometheus metrics and exposing them as custom API metrics, but if we want more control over which of our custom Prometheus metrics gets synthesized, we need to add a custom rule(s).

This can be done by passing a custom values file to helm.  Let’s create a custom values file that looks for any raw Prometheus metric ending with ‘_total’ (a counter), and creates a custom API metric of its rate of change over 2 minutes suffixed with “_per_min”.

Using our concrete example in the following section, a deployment might have 3 replicas of a web server each providing a raw Prometheus metric named “request_count_promtotal” indicating how many HTTP requests had been processed.  The custom rule below would take that absolute counter and calculate the rate of change over a 2 minute period, then take the sum of all replicas.

This value would be exposed via the API as the custom metric “request_count_per_min”, and be responsible for scaling up the replica count of a deployment during high load.

cat << 'EOF' >>helm-values.yaml
rules:
  custom:
  - seriesQuery: '{namespace!="",__name__!~"^container_.*"}'
    resources:
      template: "<<.Resource>>"
    name:
      matches: "^(.*)_promtotal"
      as: "${1}_per_min"
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
EOF

This file can also be fetched using ‘wget https://raw.githubusercontent.com/fabianlee/k3s-cluster-kvm/main/roles/prometheus-adapter/templates/helm-values.yaml’

Update prometheus-adapter with new custom rule

Pass this file in as a custom values file to ‘helm upgrade’ in order to update.

# show values file containing custom rule
cat helm-values.yaml

# update helm chart with rules from values file
helm upgrade $adapter_release_name prometheus-community/prometheus-adapter --namespace=$adapter_release_ns --set prometheus.url=http://$prom_service_name.$prom_release_ns.svc --set prometheus.port=$prom_service_port --values=./helm-values.yaml

# view values used in latest update, including custom rule for '_promtotal'
helm get values $release_name -n $release_ns

# view out-of-the-box rules as well as our newly added one
kubectl get configmap -n prom $deployment_name -o yaml

# check status of deployment
kubectl get deployment -n $release_ns $deployment_name

# restart deployment to make sure custom rule added
kubectl rollout restart deployment -n $release_ns $deployment_name
kubectl rollout status deployment -n $release_ns $deployment_name --timeout=120s

# problem if 'unable to update' found in logs, probably bad connection to Prometheus service
# debug using container args, --v=8
kubectl logs deployment/$deployment_name -n $release_ns | grep 'unable to update'

# API group 'custom.metrics.k8s.io' should have an entry
kubectl api-versions | grep "^metrics.k8s.io"

# validate custom pod metrics can be pulled via API
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta2 | jq | grep 'pods/'

HorizontalPodAutoscaler using custom API metrics

The HorizontalPodAutoScaler is capable of scaling based on custom API metrics.  So we will use the custom API metrics (which are synthesized from the raw Prometheus metrics) to drive the scaling decisions of the HPA.

Example Deployment/Service that exposes Prometheus metrics

Apply into your Kubernetes cluster the sample golang-hello-world-web Service and Deployment.  This is a tiny containerized web server I wrote in GoLang (source).

# apply into Kubernetes cluster
$ kubectl apply -f https://raw.githubusercontent.com/fabianlee/alpine-apache-benchmark/main/kubernetes-hpa/golang-hello-world-web.yaml
service/golang-hello-world-web-service created
deployment.apps/golang-hello-world-web created

# wait for deployment to be ready
$ kubectl rollout status deployment golang-hello-world-web -n default --timeout=90s
deployment "golang-hello-world-web" successfully rolled out

# Deployment has '1' replica
$ kubectl get deployment golang-hello-world-web
NAME READY UP-TO-DATE AVAILABLE AGE
golang-hello-world-web 1/1 1 1 66s

# and exposed via Service
$ kubectl get service golang-hello-world-web-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
golang-hello-world-web-service ClusterIP 10.43.227.130  8080/TCP 46s

# set hello service variables for use later
hello_ns=default
hello_deployment_name=golang-hello-world-web
hello_service_name=golang-hello-world-web-service
hello_service_IP=$(kubectl get service $hello_service_name -n $hello_ns -o=jsonpath='{.spec.clusterIP}')
hello_service_port=$(kubectl get service $hello_service_name -n $hello_ns -o=jsonpath='{.spec.ports[?(@.name=="http")].port}')
echo "connect to $hello_service_name service at $hello_service_IP : $hello_service_port" 

Validate that raw Prometheus metrics are exposed

Each pod in the ‘golang-hello-world-web’ deployment exposes Prometheus formatted metrics at the standard “/metrics” endpoint.  One of these metrics is “request_count_promtotal”, which is an absolute counter of how many HTTP requests have been served.

# smoke test of curl to simple web server pod, run multiple times to drive traffic
for i in $(seq 1 10); do kubectl exec -it deployment/$hello_deployment_name -- wget http://$hello_service_IP:$hello_service_port/myhello/ -O - ; done

# show prometheus /metrics endpoint, look for 'request_count_promtotal'
(kubectl exec -it deployment/$hello_deployment_name -- wget http://$hello_service_IP:$hello_service_port/metrics -O -) | grep request_count_promtotal

Validate that /metrics values are persisted to Prometheus

Prove that the metric exposed by the container at /metrics is being ingested and persisted by Prometheus by pulling the metric ‘request_count_promtotal’ directly from the Prometheus API using its /api/v1/query endpoint.

# curl to Prometheus /api/v1/query endpoint to validate 'request_count_promtotal'
$ (kubectl run -i --rm load-generator --image=ghcr.io/fabianlee/alpine-apache-benchmark:1.0.2 --restart=Never --command curl -- -fs http://$prom_service_IP:$prom_service_port/api/v1/query --data-urlencode "query=request_count_promtotal{pod=~'golang-hello-world-web-.*'}") | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "request_count_promtotal",
          "endpoint": "http",
          "instance": "10.42.2.11:8080",
          "job": "golang-hello-world-web-service",
          "namespace": "default",
          "pod": "golang-hello-world-web-7d468d488c-6kzcz",
          "service": "golang-hello-world-web-service"
        },
        "value": [
          1694304971.051,
          "18"
        ]
      }
    ]
  }
}

Validate that Prometheus metrics are synthesized into custom API metrics by prometheus-adapter

We need to validate that our custom rule is capturing this ‘request_count_promtotal’ key and has exposed it as a custom API metric as ‘request_count_per_minute’.

$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta2 | jq | grep 'pods/' | grep request_count
      "name": "pods/request_count_per_min",
      "name": "pods/request_count_promtotal",

This proves our pod level Prometheus key ‘request_count_promtotal’ is being processed by the custom rule, and its rate can be found as the custom API metric ‘request_count_per_min’.

We should also be able to query down to the pod level with a selector filter and get the exact value of the custom metric.

# use jq utility to parse out values of each pod 'request_count_per_min'
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2/namespaces/default/pods/*/request_count_per_min?selector=app%3D$hello_deployment_name" | jq '.items[] | select (.metric.name=="request_count_per_min").value'

It can take 60-120 seconds for the values to be updated, since the prometheus-adapter scrapes them from the /metrics at a specified interval.  If the value being returned is “0”, then place some load on the deployment, wait 60 seconds and try again.

# place simple load on the deployment
for i in $(seq 1 40); do kubectl exec -it deployment/$hello_deployment_name -- wget http://$hello_service_IP:$hello_service_port/myhello/ -O - ; done

# wait 30 seconds, try pulling value again
sleep 30
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2/namespaces/default/pods/*/request_count_per_min?selector=app%3D$hello_deployment_name" | jq '.items[] | select (.metric.name=="request_count_per_min").value'

Apply HorizontalPodAutoscaler that triggers based on custom API metrics

Now create an HPA that triggers scaling based on the custom API pod metric ‘request_count_per_min’ being at a higher rate that 1 request every 2 seconds.

# apply
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: golang-hello-world-web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: golang-hello-world-web
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Pods
    pods:
      metric:
        name: request_count_per_min
      target:
        type: Value
        averageValue: 500m # 500 milli-requests/second = 1 request every 2 seconds
  behavior: 
    scaleDown:
      stabilizationWindowSeconds: 20 # seconds wait before adjusting, avoids flaping
      policies:
      - type: Pods
        value: 1  # number of pods to scale down at one time
        periodSeconds: 20 # seconds before each scale down
      selectPolicy: Max
EOF

# validate creation of HPA
kubectl get hpa golang-hello-world-web

This HPA can also be applied using: ‘kubectl apply -f https://raw.githubusercontent.com/fabianlee/k3s-cluster-kvm/main/roles/prometheus-adapter/files/hpa.yaml’

Load Testing deployment to show scaling based on custom API metrics

In order to see the HPA triggered by the ‘request_count_per_min’ metric, we need to place a higher load on the pods in the service.  The easiest way to do this is use a container that has the Apache Benchmark utility, especially since it means we do not have to consider various Ingress options, we can go straight to the internal cluster IP of the service.

# run load test that fetches the service 200 times, simulating 5 simultaneous users
kubectl run -i --rm --tty load-generator --image=ghcr.io/fabianlee/alpine-apache-benchmark:1.0.2 --restart=Never --command ab -- -n200 -c5 http://$hello_service_IP:$hello_service_port/myhello/

# show prometheus /metrics endpoint, look for 'request_count_promtotal'
(kubectl exec -it deployment/golang-hello-world-web -- wget http://localhost:8080/metrics -O -) | grep request_count_promtotal

# wait 30 seconds
sleep 30

# At prometheus-adapter level, view 'request_count_per_min' counter
# > 500m will scale HPA
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2/namespaces/default/pods/*/request_count_per_min?selector=app%3D$hello_deployment_name" | jq '.items[] | select (.metric.name=="request_count_per_min").value'

# the target column will report the rate it sees, if it goes over 500m, then scaling occurs
kubectl get hpa golang-hello-world-web-hpa

# and the replica count will increase, but not more than maxReplicas (5)
kubectl get deployment golang-hello-world-web

# within 60 seconds the rate count of 'request_count_promtotal' will start decreasing
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2/namespaces/default/pods/*/request_count_per_min?selector=app%3D$hello_deployment_name" | jq '.items[] | select (.metric.name=="request_count_per_min").value'

# and that will cause scaling down of the deployment every 20 seconds by 1 pod, until it reaches minReplicas=1
while [ 1 -eq 1 ]; do kubectl get deployment golang-hello-world-web; sleep 5; done

 

REFERENCES

kubernetes.io, types of metric apis and explanation

kubernetes.io, scaling on external metric

kubernetes.io, resource metrics pipeline

github, metrics-server source

stackoverflow, getting custom metrics with kubectl raw

blog.px.dev, custom metrics server explanation

github pixie-io, custom metrics server source

github kubernetes-sigs, Prometheus Adapter for custom metrics

github kubernetes-sigs, Prometheus Adapter walkthrough

Cezar Romaniuc, Kubernetes HPA with custom metrics from Prometheus

github luxas, custom metrics server

github issue, troubleshooting setup of prometheus-adapter

Cezar Romaniuc, HPA with custom metrics from Prometheus

Sudip Sengupta, autoscale with prometheus-adapter and custom metrics

sysdig.com, scaling HPA with KEDA

promlabs, PromQL cheat sheet

fabianlee.org, Prometheus installed on K3s