-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.13.0
-
Quality / Stability / Reliability
-
False
-
-
1
-
Moderate
-
None
-
None
-
None
-
MCO Sprint 231
-
1
-
+
-
None
-
Bug Fix
-
MCDReboot alert will now stay latched past 15 minutes and not clear automatically.
-
None
-
None
-
None
-
None
Description of problem:
When there is a problem while rebooting a node, a MCDRebootError alarm is risen. This alarm disappears after 15 minutes, even if the machine was not rebooted.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2022-12-22-120609 True False 26m Cluster version is 4.13.0-0.nightly-2022-12-22-120609
How reproducible:
Always
Steps to Reproduce:
1. Execute these commands in a worker node in order to break the reboot process.
$ mount -o remount,rw /usr
$ mv /usr/bin/systemd-run /usr/bin/systemd-run2
2. Creat any MC. For example, this one:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: test-file
spec:
config:
ignition:
version: 3.1.0
storage:
files:
- contents:
source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
filesystem: root
mode: 0644
path: /etc/test
Actual results:
A MCDRebootError alarm is triggered. But after 15 minutes this alarm disappears.
Expected results:
The alarm should not disappear after 15 minutes. It should remain there until the node is rebooted.
Additional info:
This is the PR that seems to introduce this behavior https://github.com/openshift/machine-config-operator/pull/3406#discussion_r1030481908
- relates to
-
MCO-1 Observability Infrastructure and Enhanced metrics in MCO
-
- Closed
-
- links to