Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-997

Delay in triggering MCDDrainError alert

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 1
    • None
    • None
    • None
    • None
    • None
    • MCO Sprint 235, MCO Sprint 236
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Delay in triggering MCDDrainError alert
      
      Current understanding is MCDDrainError alert should pop up if any pod (infrastructure pod or application) on the node is not getting removed due to PDB. From below code we understnd that alert is looking at logs of machine-config-daemon pod of that node.
      
      ~~~
      - name: mcd-drain-error
            rules:
            - alert: MCDDrainError
              annotations:
                message: 'Drain failed on {{ $labels.node }} , updates may be blocked. For
                  more details:  oc logs -f -n {{ $labels.namespace }} {{ $labels.pod }}
                  -c machine-config-daemon'
              expr: |
                mcd_drain_err > 0  
              labels:
                severity: warning
      ~~~
      
      Issue:
      
      For Customer : Even after "Draining failed" in  machine-config-daemon MCDDrainError alert  did not appear.
      
      For us (during reprod): I left my cluster in same state where MCP was stuck due to pdb for One week I saw various  "Drain failed" logs .. but alert appeared on last day of the week

      Version-Release number of selected component (if applicable):

      4.8.43

      How reproducible:

      can be reproduced:

      Steps to Reproduce:

      1. Taint a worker node say worker-A
      2. Create application add : Nodeselector (worker-A), tolration (worker-A), PDB 
      3. Create a test Machine Config to rollout woker MCP
      ~~~
      name: 99-worker-test6
        labels:
          machineconfiguration.openshift.io/role: worker
      spec:
        config:
          ignition:
            version: 3.2.0
          storage:
            files:
            - contents:
                source: data:,test-6
              filesystem: root
              mode: 0644
              path: /etc/test6
      ~~~
      4. check machine-config-daemon logs of the node

      Actual results:

      MCDDrainError do not appear!

      Expected results:

      MCDDrainError should appear as soon as Drain fails due to PDB

      Additional info:

      Alert intantly appered with an infrastructure pod was facing PDB issue.

      kcs: https://access.redhat.com/solutions/6974995

              djoshy David Joshy
              rhn-support-anisal Apurva Nisal
              Rio Liu Rio Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: