Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 4.13
Affects Version/s: 4.13
Component/s: Cloud Compute / MachineHealthCheck
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.13.0
Release Blocker:
Rejected
Sprint:
CLOUD Sprint 229, CLOUD Sprint 230
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, when a machine health check exceeded the `maxUnhealthy` threshold and generated an alert, the metric was not reset when the cluster became healthy enough to reconcile machine health checks successfully, and the alert continued to fire. With this release, the logic that determines when to trigger an alert is improved so that the alert now clears when the cluster is healthy.
(link:https://issues.redhat.com/browse/OCPBUGS-4725[*~~OCPBUGS-4725~~*])

Show
* Previously, when a machine health check exceeded the `maxUnhealthy` threshold and generated an alert, the metric was not reset when the cluster became healthy enough to reconcile machine health checks successfully, and the alert continued to fire. With this release, the logic that determines when to trigger an alert is improved so that the alert now clears when the cluster is healthy. (link: https://issues.redhat.com/browse/OCPBUGS-4725 [* OCPBUGS-4725 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

mapi_machinehealthcheck_short_circuit is not properly reconciling the state, when a MachineHealthCheck is failing because of unhealthy Machines but then is removed.

When doing two MachineSet (called blue and green and only one has running Machines at a specific point in time) with MachineAutoscaler and MachineHealthCheck, the mapi_machinehealthcheck_short_circuit will continue to report 1 for MachineHealth that actually was removed because of a switch from blue to green.

$ oc get machineset | egrep 'blue|green'
housiocp4-wvqbx-worker-blue-us-east-2a    0         0                             2d17h
housiocp4-wvqbx-worker-green-us-east-2a   1         1         1       1           2d17h

$ oc get machineautoscaler
NAME                      REF KIND     REF NAME                                   MIN   MAX   AGE
worker-green-us-east-1a   MachineSet   housiocp4-wvqbx-worker-green-us-east-2a   1     4     2d17h

$ oc get machinehealthcheck
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           0                  0
worker-green-us-east-1a           40%            1                  1

      {
        "name": "machine-health-check-unterminated-short-circuit",
        "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-machine-api-machine-api-operator-prometheus-rules-ccb650d9-6fc4-422b-90bb-70452f4aff8f.yaml",
        "rules": [
          { 
            "state": "firing",
            "name": "MachineHealthCheckUnterminatedShortCircuit",
            "query": "mapi_machinehealthcheck_short_circuit == 1",
            "duration": 1800,
            "labels": {
              "severity": "warning"
            },
            "annotations": {
              "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n",
              "summary": "machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"
            },
            "alerts": [
              { 
                "labels": {
                  "alertname": "MachineHealthCheckUnterminatedShortCircuit",
                  "container": "kube-rbac-proxy-mhc-mtrc",
                  "endpoint": "mhc-mtrc",
                  "exported_namespace": "openshift-machine-api",
                  "instance": "10.128.0.58:8444",
                  "job": "machine-api-controllers",
                  "name": "worker-blue-us-east-1a",
                  "namespace": "openshift-machine-api",
                  "pod": "machine-api-controllers-779dcb8769-8gcn6",
                  "service": "machine-api-controllers",
                  "severity": "warning"
                },
                "annotations": {
                  "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n",
                  "summary": "machine health check worker-blue-us-east-1a has been disabled by short circuit for more than 30 minutes"
                },
                "state": "firing",
                "activeAt": "2022-12-09T15:59:25.1287541Z",
                "value": "1e+00"
              }
            ],
            "health": "ok",
            "evaluationTime": 0.000648129,
            "lastEvaluation": "2022-12-12T09:35:55.140174009Z",
            "type": "alerting"
          }
        ],
        "interval": 30,
        "limit": 0,
        "evaluationTime": 0.000661589,
        "lastEvaluation": "2022-12-12T09:35:55.140165629Z"
      },

As we can see above, worker-blue-us-east-1a is no longer available and active but rather worker-green-us-east-1a. But worker-blue-us-east-1a was there before the switch to green has happen and was actuall reporting some unhealthy Machines. But since it's now gone, mapi_machinehealthcheck_short_circuit should properly reconcile as otherwise this is a false/positive alert.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.0-rc.3 (but is also seen on previous version)

How reproducible:

- Always

Steps to Reproduce:

1. Setup OpenShift Container Platform 4 on AWS for example
2. Create blue and green MachineSet with MachineAutoScaler and MachineHealthCheck
3. Have active Machines for blue only
4. Trigger unhealthy Machines in blue MachineSet
5. Switch to green MachineSet, by removing MachineHealthCheck, MachineAutoscaler and setting replicate of blue MachineSet to 0
6. Create green MachineHealthCheck, MachineAutoscaler and scale geen MachineSet to 1
7. Observe how mapi_machinehealthcheck_short_circuit continues to report unhealthy state for blue MachineHealthCheck which no longer exists.

Actual results:

mapi_machinehealthcheck_short_circuit reporting problematic MachineHealthCheck even though the faulty MachineHealthCheck does no longer exist.

Expected results:

mapi_machinehealthcheck_short_circuit to properly reconcile it's state and remove MachineHealthChecks that have been removed on OpenShift Container Platform level

Additional info:

It kind of looks like similar to the issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=2013528 respectively https://bugzilla.redhat.com/show_bug.cgi?id=2047702 (although https://bugzilla.redhat.com/show_bug.cgi?id=2047702 may not be super relevant)

blocks

OCPBUGS-8286 mapi_machinehealthcheck_short_circuit not properly reconciling causing MachineHealthCheckUnterminatedShortCircuit alert to fire

Closed

is cloned by

OCPBUGS-8286 mapi_machinehealthcheck_short_circuit not properly reconciling causing MachineHealthCheckUnterminatedShortCircuit alert to fire

Closed

links to

[OCPBUGS-4725]: Short circuit misfiring

MachineHealthCheckUnterminatedShortCircuit firing for already removed MachineHealthCheck in OpenShift Container Platform 4

openshift/machine-api-operator#1109: [release-4.12] OCPBUGS-4725: Short circuit misfiring

Assignee:: Daniel Odvarka (Inactive)

Reporter:: Simon Reber

QA Contact:: Zhaohua Sun

Doc Contact:: Jeana Routh

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/12/12 9:48 AM

Updated:: 2025/09/12 9:09 PM

Resolved:: 2023/05/17 10:38 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates