Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13.0
Component/s: Machine Config Operator
Labels:
- mco-triaged

Test Coverage:

+
Severity:
Moderate
Regression:
None
Sprint:
MCO Sprint 231
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
MCDReboot alert will now stay latched past 15 minutes and not clear automatically.
Release Note Type:
Bug Fix
Target Version:

4.13.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When there is a problem while rebooting a node, a MCDRebootError alarm is risen. This alarm disappears after 15 minutes, even if the machine was not rebooted.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2022-12-22-120609   True        False         26m     Cluster version is 4.13.0-0.nightly-2022-12-22-120609

How reproducible:

Always

Steps to Reproduce:

1. Execute these commands in a worker node in order to break the reboot process.

$ mount -o remount,rw /usr
$ mv /usr/bin/systemd-run /usr/bin/systemd-run2

2. Creat any MC. For example, this one:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 0644
        path: /etc/test

Actual results:

A MCDRebootError alarm is triggered. But after 15 minutes this alarm disappears.

Expected results:

The alarm should not disappear after 15 minutes. It should remain there until the node is rebooted.

Additional info:

This is the PR that seems to introduce this behavior
https://github.com/openshift/machine-config-operator/pull/3406#discussion_r1030481908

relates to

MCO-1 Observability Infrastructure and Enhanced metrics in MCO

Closed

links to

openshift/machine-config-operator#3507: OCPBUGS-5497: MCDRebootError alarm disappears after 15 minutes

Assignee:: David Joshy

Reporter:: Sergio Regidor de la Rosa

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/01/09 12:38 PM

Updated:: 2023/05/24 9:41 AM

Resolved:: 2023/05/17 10:32 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates