-
Bug
-
Resolution: Done
-
Major
-
None
-
4.12.z
-
Important
-
No
-
Rejected
-
False
-
This is a clone of issue OCPBUGS-8286. The following is the description of the original issue:
—
Description of problem:
mapi_machinehealthcheck_short_circuit is not properly reconciling the state, when a MachineHealthCheck is failing because of unhealthy Machines but then is removed. When doing two MachineSet (called blue and green and only one has running Machines at a specific point in time) with MachineAutoscaler and MachineHealthCheck, the mapi_machinehealthcheck_short_circuit will continue to report 1 for MachineHealth that actually was removed because of a switch from blue to green. $ oc get machineset | egrep 'blue|green' housiocp4-wvqbx-worker-blue-us-east-2a 0 0 2d17h housiocp4-wvqbx-worker-green-us-east-2a 1 1 1 1 2d17h $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE worker-green-us-east-1a MachineSet housiocp4-wvqbx-worker-green-us-east-2a 1 4 2d17h $ oc get machinehealthcheck NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY machine-api-termination-handler 100% 0 0 worker-green-us-east-1a 40% 1 1 { "name": "machine-health-check-unterminated-short-circuit", "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-machine-api-machine-api-operator-prometheus-rules-ccb650d9-6fc4-422b-90bb-70452f4aff8f.yaml", "rules": [ { "state": "firing", "name": "MachineHealthCheckUnterminatedShortCircuit", "query": "mapi_machinehealthcheck_short_circuit == 1", "duration": 1800, "labels": { "severity": "warning" }, "annotations": { "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n", "summary": "machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes" }, "alerts": [ { "labels": { "alertname": "MachineHealthCheckUnterminatedShortCircuit", "container": "kube-rbac-proxy-mhc-mtrc", "endpoint": "mhc-mtrc", "exported_namespace": "openshift-machine-api", "instance": "10.128.0.58:8444", "job": "machine-api-controllers", "name": "worker-blue-us-east-1a", "namespace": "openshift-machine-api", "pod": "machine-api-controllers-779dcb8769-8gcn6", "service": "machine-api-controllers", "severity": "warning" }, "annotations": { "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n", "summary": "machine health check worker-blue-us-east-1a has been disabled by short circuit for more than 30 minutes" }, "state": "firing", "activeAt": "2022-12-09T15:59:25.1287541Z", "value": "1e+00" } ], "health": "ok", "evaluationTime": 0.000648129, "lastEvaluation": "2022-12-12T09:35:55.140174009Z", "type": "alerting" } ], "interval": 30, "limit": 0, "evaluationTime": 0.000661589, "lastEvaluation": "2022-12-12T09:35:55.140165629Z" }, As we can see above, worker-blue-us-east-1a is no longer available and active but rather worker-green-us-east-1a. But worker-blue-us-east-1a was there before the switch to green has happen and was actuall reporting some unhealthy Machines. But since it's now gone, mapi_machinehealthcheck_short_circuit should properly reconcile as otherwise this is a false/positive alert.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12.0-rc.3 (but is also seen on previous version)
How reproducible:
- Always
Steps to Reproduce:
1. Setup OpenShift Container Platform 4 on AWS for example 2. Create blue and green MachineSet with MachineAutoScaler and MachineHealthCheck 3. Have active Machines for blue only 4. Trigger unhealthy Machines in blue MachineSet 5. Switch to green MachineSet, by removing MachineHealthCheck, MachineAutoscaler and setting replicate of blue MachineSet to 0 6. Create green MachineHealthCheck, MachineAutoscaler and scale geen MachineSet to 1 7. Observe how mapi_machinehealthcheck_short_circuit continues to report unhealthy state for blue MachineHealthCheck which no longer exists.
Actual results:
mapi_machinehealthcheck_short_circuit reporting problematic MachineHealthCheck even though the faulty MachineHealthCheck does no longer exist.
Expected results:
mapi_machinehealthcheck_short_circuit to properly reconcile it's state and remove MachineHealthChecks that have been removed on OpenShift Container Platform level
Additional info:
It kind of looks like similar to the issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=2013528 respectively https://bugzilla.redhat.com/show_bug.cgi?id=2047702 (although https://bugzilla.redhat.com/show_bug.cgi?id=2047702 may not be super relevant)
- clones
-
OCPBUGS-8286 mapi_machinehealthcheck_short_circuit not properly reconciling causing MachineHealthCheckUnterminatedShortCircuit alert to fire
- Closed
- is blocked by
-
OCPBUGS-8286 mapi_machinehealthcheck_short_circuit not properly reconciling causing MachineHealthCheckUnterminatedShortCircuit alert to fire
- Closed
- links to