Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5089

4.8 node-exporter daemonset does not filter nodeReadyCount with kubernetes.io/os=linux nodeSelector

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • 4.8.z
    • Monitoring
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:
      the bug is found when debug https://issues.redhat.com/browse/OCPQE-13200

      deploy 4.8.0-0.nightly-2022-11-30-073158 with aos-4_8/ipi-on-aws/versioned-installer-ovn-winc-ci template, the template created cluster with 3 linux masters, 3 linux workers and 2 windows workers. ip-10-0-149-219.us-east-2.compute.internal/ip-10-0-158-129.us-east-2.compute.internal are windows workers in this bug(they are with kubernetes.io/os=windows label, not kubernetes.io/os=linux).

      # oc get node --show-labels
      NAME                                         STATUS   ROLES    AGE     VERSION                             LABELS
      ip-10-0-139-166.us-east-2.compute.internal   Ready    worker   4h35m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-139-166,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
      ip-10-0-143-178.us-east-2.compute.internal   Ready    master   4h47m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-143-178,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
      ip-10-0-149-219.us-east-2.compute.internal   Ready    worker   3h51m   v1.21.11-rc.0.1506+5cc9227e4695d1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-2hcbpla,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
      ip-10-0-158-129.us-east-2.compute.internal   Ready    worker   3h45m   v1.21.11-rc.0.1506+5cc9227e4695d1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-golrucd,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
      ip-10-0-175-105.us-east-2.compute.internal   Ready    worker   4h35m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-175-105,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b
      ip-10-0-188-67.us-east-2.compute.internal    Ready    master   4h43m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-188-67,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b
      ip-10-0-192-42.us-east-2.compute.internal    Ready    worker   4h35m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-192-42,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c
      ip-10-0-210-137.us-east-2.compute.internal   Ready    master   4h43m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-210-137,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c
      
      # oc get node -l kubernetes.io/os=linux
      NAME                                         STATUS   ROLES    AGE     VERSION
      ip-10-0-139-166.us-east-2.compute.internal   Ready    worker   4h31m   v1.21.14+a17bdb3
      ip-10-0-143-178.us-east-2.compute.internal   Ready    master   4h43m   v1.21.14+a17bdb3
      ip-10-0-175-105.us-east-2.compute.internal   Ready    worker   4h31m   v1.21.14+a17bdb3
      ip-10-0-188-67.us-east-2.compute.internal    Ready    master   4h39m   v1.21.14+a17bdb3
      ip-10-0-192-42.us-east-2.compute.internal    Ready    worker   4h31m   v1.21.14+a17bdb3
      ip-10-0-210-137.us-east-2.compute.internal   Ready    master   4h40m   v1.21.14+a17bdb3
      
      # oc get node -l kubernetes.io/os=windows
      NAME                                         STATUS   ROLES    AGE     VERSION
      ip-10-0-149-219.us-east-2.compute.internal   Ready    worker   3h48m   v1.21.11-rc.0.1506+5cc9227e4695d1
      ip-10-0-158-129.us-east-2.compute.internal   Ready    worker   3h41m   v1.21.11-rc.0.1506+5cc9227e4695d1

      monitoring is degrade for "expected 8 ready pods for "node-exporter" daemonset, got 6"

      # oc get co monitoring -oyaml
      ...
        - lastTransitionTime: "2022-12-21T03:08:47Z"
          message: 'Failed to rollout the stack. Error: running task Updating node-exporter
            failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object
            failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter:
            expected 8 ready pods for "node-exporter" daemonset, got 6 '
          reason: UpdatingnodeExporterFailed
          status: "True"
          type: Degraded
        extension: null
      

      same errors in CMO logs

      # oc -n openshift-monitoring logs -c cluster-monitoring-operator cluster-monitoring-operator-7fd77f4b87-pnfm9 | grep "reconciling node-exporter DaemonSet failed" | tail
      I1221 07:30:52.343230       1 operator.go:503] ClusterOperator reconciliation failed (attempt 55), retrying. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 
      E1221 07:30:52.343253       1 operator.go:402] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 
      I1221 07:35:54.713045       1 operator.go:503] ClusterOperator reconciliation failed (attempt 56), retrying. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 
      E1221 07:35:54.713064       1 operator.go:402] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 
      

      node-exporter pods are in kubernetes.io/os=linux nodes

      # oc -n openshift-monitoring get ds
      NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
      node-exporter   6         6         6       6            6           kubernetes.io/os=linux   4h33m
      # oc -n openshift-monitoring get pod -o wide | grep node-exporter
      node-exporter-2tkxv                            2/2     Running   0          5h35m   10.0.188.67    ip-10-0-188-67.us-east-2.compute.internal    <none>           <none>
      node-exporter-hbn65                            2/2     Running   0          5h31m   10.0.175.105   ip-10-0-175-105.us-east-2.compute.internal   <none>           <none>
      node-exporter-prn9h                            2/2     Running   0          5h35m   10.0.143.178   ip-10-0-143-178.us-east-2.compute.internal   <none>           <none>
      node-exporter-q4tsw                            2/2     Running   0          5h31m   10.0.192.42    ip-10-0-192-42.us-east-2.compute.internal    <none>           <none>
      node-exporter-qx7dc                            2/2     Running   0          5h31m   10.0.139.166   ip-10-0-139-166.us-east-2.compute.internal   <none>           <none>
      node-exporter-zrsnx                            2/2     Running   0          5h35m   10.0.210.137   ip-10-0-210-137.us-east-2.compute.internal   <none>           <none>
      
      # oc -n openshift-monitoring get ds node-exporter -oyaml
      ...
      status:
        currentNumberScheduled: 6
        desiredNumberScheduled: 6
        numberAvailable: 6
        numberMisscheduled: 0
        numberReady: 6
        observedGeneration: 1
        updatedNumberScheduled: 6

      reason why CMO reports monitoring is degraded is 4.8 treats all Ready node to nodeReadyCount, no matter they have kubernetes.io/os=linux or not

      https://github.com/openshift/cluster-monitoring-operator/blob/release-4.8/pkg/client/client.go#L951-L959

      the issue is fixed in 4.9+

      https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/pkg/client/client.go#L1052-L1061

      Version-Release number of selected component (if applicable):

      deploy 4.8.0-0.nightly-2022-11-30-073158 with aos-4_8/ipi-on-aws/versioned-installer-ovn-winc-ci template, the template created cluster with 3 linux masters, 3 linux workers and 2 windows workers

      How reproducible:

      deploy OCP 4.8 on linux worker + windows worker

      Steps to Reproduce:

      1. see the description
      2.
      3.
      

      Actual results:

      monitoring is degraded for
      waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 

      Expected results:

      no degraded

      Additional info:

      if we don't want to fix the bug in 4.8, we can close this bug

              janantha@redhat.com Jayapriya Pai
              juzhao@redhat.com Junqi Zhao
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: