Loading...

XML

Word

Printable

Type: Bug
Resolution: Can't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.8
Component/s: Machine Config Operator
Labels:
- mco-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
1
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
MCO Sprint 235, MCO Sprint 236
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Delay in triggering MCDDrainError alert

Current understanding is MCDDrainError alert should pop up if any pod (infrastructure pod or application) on the node is not getting removed due to PDB. From below code we understnd that alert is looking at logs of machine-config-daemon pod of that node.

~~~
- name: mcd-drain-error
      rules:
      - alert: MCDDrainError
        annotations:
          message: 'Drain failed on {{ $labels.node }} , updates may be blocked. For
            more details:  oc logs -f -n {{ $labels.namespace }} {{ $labels.pod }}
            -c machine-config-daemon'
        expr: |
          mcd_drain_err > 0  
        labels:
          severity: warning
~~~

Issue:

For Customer : Even after "Draining failed" in  machine-config-daemon MCDDrainError alert  did not appear.

For us (during reprod): I left my cluster in same state where MCP was stuck due to pdb for One week I saw various  "Drain failed" logs .. but alert appeared on last day of the week

Version-Release number of selected component (if applicable):

4.8.43

How reproducible:

can be reproduced:

Steps to Reproduce:

1. Taint a worker node say worker-A
2. Create application add : Nodeselector (worker-A), tolration (worker-A), PDB 
3. Create a test Machine Config to rollout woker MCP
~~~
name: 99-worker-test6
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,test-6
        filesystem: root
        mode: 0644
        path: /etc/test6
~~~
4. check machine-config-daemon logs of the node

Actual results:

MCDDrainError do not appear!

Expected results:

MCDDrainError should appear as soon as Drain fails due to PDB

Additional info:

Alert intantly appered with an infrastructure pod was facing PDB issue.

kcs: https://access.redhat.com/solutions/6974995

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

mcd_drain_null_result.PNG
93 kB
2022/09/08 12:53 PM
Screenshot from 2022-09-07 00-27-35.png
55 kB
2022/09/07 8:07 PM

Assignee:: David Joshy

Reporter:: Apurva Nisal

QA Contact:: Rio Liu

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/09/07 7:56 PM

Updated:: 2025/09/12 8:57 PM

Resolved:: 2023/05/09 12:17 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates