Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8311

mapi_machinehealthcheck_short_circuit not properly reconciling causing MachineHealthCheckUnterminatedShortCircuit alert to fire

    XMLWordPrintable

Details

    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      This is a clone of issue OCPBUGS-8286. The following is the description of the original issue:

      Description of problem:

      mapi_machinehealthcheck_short_circuit is not properly reconciling the state, when a MachineHealthCheck is failing because of unhealthy Machines but then is removed.
      
      When doing two MachineSet (called blue and green and only one has running Machines at a specific point in time) with MachineAutoscaler and MachineHealthCheck, the mapi_machinehealthcheck_short_circuit will continue to report 1 for MachineHealth that actually was removed because of a switch from blue to green.
      
      $ oc get machineset | egrep 'blue|green'
      housiocp4-wvqbx-worker-blue-us-east-2a    0         0                             2d17h
      housiocp4-wvqbx-worker-green-us-east-2a   1         1         1       1           2d17h
      
      $ oc get machineautoscaler
      NAME                      REF KIND     REF NAME                                   MIN   MAX   AGE
      worker-green-us-east-1a   MachineSet   housiocp4-wvqbx-worker-green-us-east-2a   1     4     2d17h
      
      $ oc get machinehealthcheck
      NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
      machine-api-termination-handler   100%           0                  0
      worker-green-us-east-1a           40%            1                  1
      
            {
              "name": "machine-health-check-unterminated-short-circuit",
              "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-machine-api-machine-api-operator-prometheus-rules-ccb650d9-6fc4-422b-90bb-70452f4aff8f.yaml",
              "rules": [
                { 
                  "state": "firing",
                  "name": "MachineHealthCheckUnterminatedShortCircuit",
                  "query": "mapi_machinehealthcheck_short_circuit == 1",
                  "duration": 1800,
                  "labels": {
                    "severity": "warning"
                  },
                  "annotations": {
                    "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n",
                    "summary": "machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"
                  },
                  "alerts": [
                    { 
                      "labels": {
                        "alertname": "MachineHealthCheckUnterminatedShortCircuit",
                        "container": "kube-rbac-proxy-mhc-mtrc",
                        "endpoint": "mhc-mtrc",
                        "exported_namespace": "openshift-machine-api",
                        "instance": "10.128.0.58:8444",
                        "job": "machine-api-controllers",
                        "name": "worker-blue-us-east-1a",
                        "namespace": "openshift-machine-api",
                        "pod": "machine-api-controllers-779dcb8769-8gcn6",
                        "service": "machine-api-controllers",
                        "severity": "warning"
                      },
                      "annotations": {
                        "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n",
                        "summary": "machine health check worker-blue-us-east-1a has been disabled by short circuit for more than 30 minutes"
                      },
                      "state": "firing",
                      "activeAt": "2022-12-09T15:59:25.1287541Z",
                      "value": "1e+00"
                    }
                  ],
                  "health": "ok",
                  "evaluationTime": 0.000648129,
                  "lastEvaluation": "2022-12-12T09:35:55.140174009Z",
                  "type": "alerting"
                }
              ],
              "interval": 30,
              "limit": 0,
              "evaluationTime": 0.000661589,
              "lastEvaluation": "2022-12-12T09:35:55.140165629Z"
            },
      
      As we can see above, worker-blue-us-east-1a is no longer available and active but rather worker-green-us-east-1a. But worker-blue-us-east-1a was there before the switch to green has happen and was actuall reporting some unhealthy Machines. But since it's now gone, mapi_machinehealthcheck_short_circuit should properly reconcile as otherwise this is a false/positive alert.
      
      

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.12.0-rc.3 (but is also seen on previous version)

      How reproducible:

      - Always

      Steps to Reproduce:

      1. Setup OpenShift Container Platform 4 on AWS for example
      2. Create blue and green MachineSet with MachineAutoScaler and MachineHealthCheck
      3. Have active Machines for blue only
      4. Trigger unhealthy Machines in blue MachineSet
      5. Switch to green MachineSet, by removing MachineHealthCheck, MachineAutoscaler and setting replicate of blue MachineSet to 0
      6. Create green MachineHealthCheck, MachineAutoscaler and scale geen MachineSet to 1
      7. Observe how mapi_machinehealthcheck_short_circuit continues to report unhealthy state for blue MachineHealthCheck which no longer exists.

      Actual results:

      mapi_machinehealthcheck_short_circuit reporting problematic MachineHealthCheck even though the faulty MachineHealthCheck does no longer exist.

      Expected results:

      mapi_machinehealthcheck_short_circuit to properly reconcile it's state and remove MachineHealthChecks that have been removed on OpenShift Container Platform level

      Additional info:

      It kind of looks like similar to the issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=2013528 respectively https://bugzilla.redhat.com/show_bug.cgi?id=2047702 (although https://bugzilla.redhat.com/show_bug.cgi?id=2047702 may not be super relevant)

      Attachments

        Issue Links

          Activity

            People

              joelspeed Joel Speed
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: