Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57218

KubeDaemonSetMisScheduled and KubeDaemonSetRolloutStuck Firing due to NodeMaintenance

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.18.z
    • Monitoring
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      
      When a node is marked down for maintenance, such as with the node maintenance operator, KubeDaemonSetMisScheduled and KubeDaemonSetRolloutStuck alerts begin firing for all platform "openshift-*" daemonsets.
      
      For a cluster with a lot of Red Hat operators, this can reach 20 or more alerts of each type (40 in total).
      
      https://github.com/openshift/cluster-monitoring-operator/blob/release-4.18/assets/control-plane/prometheus-rule.yaml#L116
      
      https://github.com/openshift/cluster-monitoring-operator/blob/release-4.18/assets/control-plane/prometheus-rule.yaml#L167
      
      When a cluster is put into maintenance, the operator does add a new taint:
      {"effect":"NoSchedule","key":"medik8s.io/drain"}
      
      We should factor a node being in maintenance / cordoned / SchedulingDisabled when it is in Ready state for the purpose of alerting.
      
      

      Version-Release number of selected component (if applicable):

      4.18.15
      

      How reproducible:

      Always
      

      Steps to Reproduce:

          1. OpenShift 4.18 Cluster
          2. Install Node Maintenance Operator
          3. Put a node into maintenance:
      
      apiVersion: nodemaintenance.medik8s.io/v1beta1
      kind: NodeMaintenance
      metadata:
        name: nodemaintenance-cr
      spec:
        nodeName: worker-0.ocp418shared.tamlab.brq2.redhat.com
        reason: "node maint"
      

      Actual results:

      A lot of extra alerts that are not providing actionable information. 
      

      Expected results:

      
      

      Additional info:

      
      

              jfajersk@redhat.com Jan Fajerski
              rhn-support-mrobson Matt Robson
              None
              None
              Junqi Zhao Junqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: