-
Bug
-
Resolution: Done
-
Critical
-
4.15.0
-
Important
-
No
-
False
-
Description of problem:
The following pre submit jobs for Local Zones are perm failing since August: - e2e-aws-ovn-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones?buildId=1716457254460329984 - e2e-aws-ovn-shared-vpc-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones Investigating we can see common failures in tests '[sig-network] can collect <poller_name> poller pod logs', leading the most of jobs to not completed correctly for those failures. Exploring the code I can see it was recently added, near August and matches with when the failures started. It is required to tolerate the label "node-role.kubernetes.io/edge" to run pods on instances located in Local Zone ("edge nodes"). I am not sure if I am looking in the correct place, but it seems it is tolerating only master labels: https://github.com/openshift/origin/blob/master/pkg/monitortests/network/disruptionpodnetwork/host-network-target-deployment.yaml#L42
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
always
Steps to Reproduce:
trigger the job: 1. open a PR on installer 2. run the job 3. check failed tests '[sig-network] can collect <poller_name> poller pod logs' Example of 4.15 blocked feature PR (Wavelength Zones): https://github.com/openshift/installer/pull/7369#issuecomment-1783699175
Actual results:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7590/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones/1715075142427611136 { 1 pods lacked sampler output: [pod-network-to-pod-network-disruption-poller-d94fb55db-9qfpz]} E1018 22:06:34.773866 1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close E1018 22:06:34.774669 1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close
Expected results:
Monitor jobs be scheduled in edge nodes? How we can track job failures for new monitor tests?
Additional info:
Edge nodes have NoSchedule taints applied by default, to run monitor pods in those nodes you need to tolerate the label "node-role.kubernetes.io/edge" See the enhancement for more informaation: https://github.com/openshift/enhancements/blob/master/enhancements/installer/aws-custom-edge-machineset-local-zones.md#user-workload-deployments Looking the must-gather of job 1716457254460329984, you can see the monitor pods not scheduled due the missing tolerations: $ grep -rni pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2 \ 1716457254460329984-must-gather/09abb0d6fc08ee340563e6e11f5ceafb42fb371e50ab6acee6764031062525b7/namespaces/openshift-kube-scheduler/pods/ \ | awk -F'] "' '{print$2}' | sort | uniq -c 215 Unable to schedule pod; no fit; waiting" pod="e2e-pod-network-disruption-test-59s5d/pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2" err="0/7 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/edge: }, 6 node(s) didn't match pod anti-affinity rules. preemption: 0/7 nodes are available: 1 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod.."
- blocks
-
SPLAT-1125 [aws] Add support to AWS Wavelength - Day 0 Fully automated
- Closed
-
SPLAT-1218 [aws] Add support to AWS Wavelength - Day 0 BYO VPC
- Closed
-
OCPBUGS-23042 Monitor tests are failing in Local Zone jobs (edge nodes)
- MODIFIED
- is cloned by
-
OCPBUGS-23042 Monitor tests are failing in Local Zone jobs (edge nodes)
- MODIFIED
- is related to
-
SPLAT-657 AWS Local Zones - Phase II - IPI automation - Installer support to create resources in Local Zone for edge pool
- Closed
- relates to
-
SPLAT-1225 [aws][local-zones][CI] Investigate jobs failing
- Closed
- links to