GKE: terraform lifecycle ‘ignore_changes’ to manage external changes to GKE cluster

As much as Terraform pushes to be the absolute system of record for resources it creates, often valid external processes are assisting in managing those same resources.

Here are some examples of legitimate external changes:

  • Other company-approved Terraform scripts applying labeling to resources in order to track ownership and costs
  • Security teams modifying IAM roles and memberships based on principles of least-privilege
  • Hyperscaler vendor auto-upgrade of components such as Kubernetes node pool

These scenarios can be addressed by using the lifecycle meta argument ignore_changes, and explicitly providing a list of attributes.

For this article, let’s dive into the specifics of managing a Google GKE cluster and how ignore_changes can be used.

Potential issues if ignore_changes not used

Let’s assume that you created a GKE cluster using the “google_container_cluster” resource.  Here is my full main.tf as an example.

If you did not define any ignore_changes attributes, the following issues could occur during the months and years of ongoing maintenance of this cluster:

  • Your organization starts a mandatory labeling initiative of resources in order to track ownership, support, and chargebacks.  These labels keep getting removed by your terraform script that are unaware of their purpose.
  • You introduce Anthos Service Mesh, which adds a cluster “mesh_id” label that enables the metrics dashboard.  This label keeps getting removed by your terraform script that is unaware of its purpose or existence.
  • The GKE node pool instance count is manually increased during the holidays because of high traffic loads.  The terraform script keeps scaling it back down, causing customer performance issues.
  • The GKE node pool has auto upgrade enabled which is supposed to reduce manual maintenance, yet keeps getting downgraded by your Terraform scripts to older non-supported versions.

Master control plane upgrades are already understood

As part of the value-add of the platform, Google automatically upgrades the GKE master control plane portion of the Kubernetes cluster.

The ‘min_master_version’ attribute of the container_cluster terraform resource was designed for this purpose, so background upgrades do not force a change in the terraform plan.

Therefore, there is no need to include this attribute in the ‘ignore_changes’ list.

Ignore cluster label changes

External services may be required to set labels at the GKE cluster level.  This can be part of an ownership/chargeback initiative, or even for services such as Anthos Service Mesh that append a “mesh_id”.

If you do not use ignore_changes on “resource_labels”, your terraform scripts will remove these additional labels.  With ignore_changes set on resource_labels, terraform will ignore any additional labels.

Below is an example of manually changing labels, and seeing it has no affect on the terraform plan.

project_id=$(gcloud config get project)
project_number=$(gcloud projects list --filter="id=$project_id" --format="value(projectNumber)")

# setup variables for cluster name and location (region or zone)
gcloud container clusters list
cluster_name=xxxxxx
location_flag="--zone=xxxx" # OR --region=xxxx

# show current labels
resource_labels=$(gcloud container clusters describe $cluster_name $location_flag --format="value(resourceLabels)" | sed 's/;/,/g')
echo "current resourceLabel: $resource_labels"

# add label
gcloud container clusters update $cluster_name $location_flag --update-labels="color=red,$resource_labels"

$ terraform plan
...
No changes. Your infrastructure matches the configuration.
...

Ignore node pool instance count scaling

Company policy may allow node pool instance counts to be tweaked manually during periods of unexpected high-load or even scaled down to save costs during low traffic months.  Use ignore_changes on the ‘initial_node_count‘ and ‘node_count‘ of the google_container_node_pool resource to avoid changes.

Below is an example of manually changing node pool instance counts and seeing it has no affect on the terraform plan.

# setup variable for node pool name
gcloud container node-pools list --cluster $cluster_name $location_flag
node_pool_name=xxxxx

# get current count
current_node_count=$(gcloud container clusters describe $cluster_name $location_flag --format="value(currentNodeCount)")

# increase by 1
((current_node_count++))
gcloud container clusters resize $cluster_name --node-pool $node_pool_name --num-nodes $current_node_count --quiet

$ terraform plan
...
No changes. Your infrastructure matches the configuration.
...

Ignore node pool version changes

If your GKE cluster has AutoUpgrade enabled for the node pool, then Google will perform upgrades during valid maintenance windows.  In order to avoid changes in terraform, include “version” in the ignore_changes of the google_container_node_pool resource.

Below is an example of manually upgrading the node pool, and seeing it has no affect on the terraform plan.

# check if 'autoUpgrade' set to true
gcloud container clusters describe $cluster_name $location_flag --format="value(nodePools.management)"

# available node pool versions
gcloud container get-server-config --format="yaml(validNodeVersions)" $location_flag
node_version="1.xx.yy-gke.zz"

# upgrade node pool
gcloud container clusters upgrade $cluster_name --node-pool $node_pool_name --cluster-version $node_version --quiet

$ terraform plan
...
No changes. Your infrastructure matches the configuration.
...

 

REFERENCES

fabianlee github, project code for this article

hashicorp ref, lifecycle ignore_changes

Dave Storey, how and went to ignore lifecycle changes in terraform

hashcorp ref, Manage Resource lifecycle

stackoverflow, example scenarios why you would use ignore_changes

 

NOTES

forcing upgrade of master control plane

# variables for cluster name and location (region or zone)
gcloud container clusters list
cluster_name=xxxxxx
location_flag="--zone=xxxx" # OR --region=xxxx

# variable for new control plane version
gcloud container get-server-config --format="yaml(validMasterVersions)" $location_flag
new_version=1.xx.y-gke.zzzz

# do control plane upgrade
gcloud container clusters upgrade $cluster_name $location_flag --cluster-version="$new_version" --quiet