-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.9.z
-
None
-
None
-
False
-
Description of problem:
the bug is found when debug https://issues.redhat.com/browse/OCPQE-13200
deploy 4.8.0-0.nightly-2022-11-30-073158 with aos-4_8/ipi-on-aws/versioned-installer-ovn-winc-ci template, the template created cluster with 3 linux masters, 3 linux workers and 2 windows workers. ip-10-0-149-219.us-east-2.compute.internal/ip-10-0-158-129.us-east-2.compute.internal are windows workers in this bug(they are with kubernetes.io/os=windows label, not kubernetes.io/os=linux).
# oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS ip-10-0-139-166.us-east-2.compute.internal Ready worker 4h35m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-139-166,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-143-178.us-east-2.compute.internal Ready master 4h47m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-143-178,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-149-219.us-east-2.compute.internal Ready worker 3h51m v1.21.11-rc.0.1506+5cc9227e4695d1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-2hcbpla,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-158-129.us-east-2.compute.internal Ready worker 3h45m v1.21.11-rc.0.1506+5cc9227e4695d1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-golrucd,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-175-105.us-east-2.compute.internal Ready worker 4h35m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-175-105,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b ip-10-0-188-67.us-east-2.compute.internal Ready master 4h43m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-188-67,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b ip-10-0-192-42.us-east-2.compute.internal Ready worker 4h35m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-192-42,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c ip-10-0-210-137.us-east-2.compute.internal Ready master 4h43m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-210-137,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c # oc get node -l kubernetes.io/os=linux NAME STATUS ROLES AGE VERSION ip-10-0-139-166.us-east-2.compute.internal Ready worker 4h31m v1.21.14+a17bdb3 ip-10-0-143-178.us-east-2.compute.internal Ready master 4h43m v1.21.14+a17bdb3 ip-10-0-175-105.us-east-2.compute.internal Ready worker 4h31m v1.21.14+a17bdb3 ip-10-0-188-67.us-east-2.compute.internal Ready master 4h39m v1.21.14+a17bdb3 ip-10-0-192-42.us-east-2.compute.internal Ready worker 4h31m v1.21.14+a17bdb3 ip-10-0-210-137.us-east-2.compute.internal Ready master 4h40m v1.21.14+a17bdb3 # oc get node -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION ip-10-0-149-219.us-east-2.compute.internal Ready worker 3h48m v1.21.11-rc.0.1506+5cc9227e4695d1 ip-10-0-158-129.us-east-2.compute.internal Ready worker 3h41m v1.21.11-rc.0.1506+5cc9227e4695d1
monitoring is degrade for "expected 8 ready pods for "node-exporter" daemonset, got 6"
# oc get co monitoring -oyaml ... - lastTransitionTime: "2022-12-21T03:08:47Z" message: 'Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 ' reason: UpdatingnodeExporterFailed status: "True" type: Degraded extension: null
same errors in CMO logs
# oc -n openshift-monitoring logs -c cluster-monitoring-operator cluster-monitoring-operator-7fd77f4b87-pnfm9 | grep "reconciling node-exporter DaemonSet failed" | tail I1221 07:30:52.343230 1 operator.go:503] ClusterOperator reconciliation failed (attempt 55), retrying. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 E1221 07:30:52.343253 1 operator.go:402] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 I1221 07:35:54.713045 1 operator.go:503] ClusterOperator reconciliation failed (attempt 56), retrying. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 E1221 07:35:54.713064 1 operator.go:402] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6
node-exporter pods are in kubernetes.io/os=linux nodes
# oc -n openshift-monitoring get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-exporter 6 6 6 6 6 kubernetes.io/os=linux 4h33m
# oc -n openshift-monitoring get pod -o wide | grep node-exporter node-exporter-2tkxv 2/2 Running 0 5h35m 10.0.188.67 ip-10-0-188-67.us-east-2.compute.internal <none> <none> node-exporter-hbn65 2/2 Running 0 5h31m 10.0.175.105 ip-10-0-175-105.us-east-2.compute.internal <none> <none> node-exporter-prn9h 2/2 Running 0 5h35m 10.0.143.178 ip-10-0-143-178.us-east-2.compute.internal <none> <none> node-exporter-q4tsw 2/2 Running 0 5h31m 10.0.192.42 ip-10-0-192-42.us-east-2.compute.internal <none> <none> node-exporter-qx7dc 2/2 Running 0 5h31m 10.0.139.166 ip-10-0-139-166.us-east-2.compute.internal <none> <none> node-exporter-zrsnx 2/2 Running 0 5h35m 10.0.210.137 ip-10-0-210-137.us-east-2.compute.internal <none> <none> # oc -n openshift-monitoring get ds node-exporter -oyaml ... status: currentNumberScheduled: 6 desiredNumberScheduled: 6 numberAvailable: 6 numberMisscheduled: 0 numberReady: 6 observedGeneration: 1 updatedNumberScheduled: 6
reason why CMO reports monitoring is degraded is 4.8 treats all Ready node to nodeReadyCount, no matter they have kubernetes.io/os=linux or not
the issue is fixed in 4.9+
Version-Release number of selected component (if applicable):
deploy 4.8.0-0.nightly-2022-11-30-073158 with aos-4_8/ipi-on-aws/versioned-installer-ovn-winc-ci template, the template created cluster with 3 linux masters, 3 linux workers and 2 windows workers
How reproducible:
deploy OCP 4.8 on linux worker + windows worker
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
monitoring is degraded for waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6
Expected results:
no degraded
Additional info:
if we don't want to fix the bug in 4.8, we can close this bug
- blocks
-
OCPBUGS-5089 4.8 node-exporter daemonset does not filter nodeReadyCount with kubernetes.io/os=linux nodeSelector
- Closed
- clones
-
OCPBUGS-5089 4.8 node-exporter daemonset does not filter nodeReadyCount with kubernetes.io/os=linux nodeSelector
- Closed