python – Kuberhealthy Check – check NodeGroup labels are matching

This is quite a specific script I have written, for Kubernetes Cluster monitoring purposes (specifically the Nodes). Essentially, I need to ensure that the Nodes within a NodeGroup all have the same labels and label values, or else they will not be scaled out evenly for Node Scale out in the version of Kubernetes we are using (Cluster Auto Scaler wants the values to be the same in order to treat the Nodes equally).

The env var IN_CLUSTER is used so I can set whether the script is run from my local machine (which can read kubectl config) or whether it’s running as a container within the cluster (leverage RBAC permissions).

The script I have written works and does what I need – get a list of Nodes in the cluster, iterate through each NodeGroup (there are four Node Groups – core, general, observability, pci). We group the Nodes into their relevant NodeGroup. We then check each Node within the NodeGroup, and do a comparison to ensure the labels match.

The script implements the Kubernetes client for retrieving list of Nodes. The script also implements the Kuberhealthy client, which is simply to report the check results (success or failure) to the Kuberhealthy master.

I do not like the fact that the four NodeGroups are hardcoded in the script but can’t think how to achieve what I want with an array stored as an env var.

The script is intended to simply run top to bottom and be simple. I’m not sure it makes sense to have the if __name__ == '__main__' directive as it’s never going to be imported as a module.

from kubernetes import client, config
from kh_client import *
import os

# requires cluster role with permissions list, get nodes!
# needs refactoring, for time being have kept it as a 'top to bottom' script

def main():
    if os.getenv('IN_CLUSTER') == "TRUE":
        config.load_incluster_config()
    else:
        config.load_kube_config()
    
    try:
        api_instance = client.CoreV1Api()
        node_list = api_instance.list_node()
    except client.exceptions.ApiException:
        print("401 Unauthorised. Please check you are authenticated for the target cluster / have set the IN_CLUSTER env var.")
        exit(2)

    node_group_core = ()
    node_group_general = ()
    node_group_observability = ()
    node_group_pci = ()

    # print("%stt%s" % ("NAME", "LABELS"))
    # this needs changing but difficult to do with an env_var
    for node in node_list.items:
        if node.metadata.labels.get('nodegroup-name') == "core":
            node_group_core.append(node)
        if node.metadata.labels.get('nodegroup-name') == "general":
            node_group_general.append(node)
        if node.metadata.labels.get('nodegroup-name') == "observability":
            node_group_observability.append(node)
        if node.metadata.labels.get('nodegroup-name') == "pci":
            node_group_pci.append(node)

    check_node_group_labels(node_group_core)
    check_node_group_labels(node_group_general)
    check_node_group_labels(node_group_observability)
    check_node_group_labels(node_group_pci)

    # everything has checked successfully, report success. 
    print("Reporting Success.")
    try:
        report_success()
    except Exception as e:
        print(f"Error when reporting success: {e}")

def check_node_group_labels(node_group):
    # ignored labels taken from https://github.com/kubernetes/autoscaler/blob/3a69f118d95cd653bf101aecc0ea5e00bf7ba370/cluster-autoscaler/processors/nodegroupset/aws_nodegroups.go#L26
    # this can be refactored
    ignored_labels = ( "alpha.eksctl.io/instance-id", 
                       "alpha.eksctl.io/nodegroup-name", 
                       "eks.amazonaws.com/nodegroup", 
                       "k8s.amazonaws.com/eniConfig",
                       "lifecycle",
                       # labels i've added
                       "topology.kubernetes.io/zone",
                       "kubernetes.io/hostname",
                       "failure-domain.beta.kubernetes.io/zone" )

    node_group_labels = ()
    for l in node_group(0).metadata.labels:
        if l not in ignored_labels:
            node_group_labels.append(l)

    print(f"There are {len(node_group)} nodes in {node_group(0).metadata.labels.get('nodegroup-name')}")

    for label in node_group_labels:
        # compare against the 'benchmark' label, any difference means a mismatch as far as CAS is concerned
        # print(label)
        benchmark_label = node_group(0).metadata.labels.get(label)
        # print("benchmark label: ", benchmark_label)
        for node in node_group(1:):
            # print("node label", node.metadata.labels.get(label))
            if node.metadata.labels.get(label) != benchmark_label:
                print("Reporting Failure.")
                try:
                    report_failure((f"Warning! label mismatch detected, for nodegroup and node {node.metadata.name}, benchmark value: {benchmark_label}, this node value: {node.metadata.labels.get(label)}"))
                except Exception as e:
                    print(f"Error when reporting failure: {e}")

if __name__ == '__main__':
    main()
```