Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.14
Affects Version/s: 4.14
Component/s: Test Framework
Labels:

Severity:
Important
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Target Version:

4.14.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-22703~~. The following is the description of the original issue:
—
Description of problem:

The following pre submit jobs for Local Zones are perm failing since August:
- e2e-aws-ovn-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones?buildId=1716457254460329984
- e2e-aws-ovn-shared-vpc-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones

Investigating we can see common failures in tests '[sig-network] can collect <poller_name> poller pod logs', leading the most of jobs to not completed correctly for those failures.

Exploring the code I can see it was recently added, near August and matches with when the failures started.

It is required to tolerate the label "node-role.kubernetes.io/edge" to run pods on instances located in Local Zone ("edge nodes"). I am not sure if I am looking in the correct place, but it seems it is tolerating only master labels: https://github.com/openshift/origin/blob/master/pkg/monitortests/network/disruptionpodnetwork/host-network-target-deployment.yaml#L42

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

always

Steps to Reproduce:

trigger the job:
1. open a PR on installer
2. run the job
3. check failed tests '[sig-network] can collect <poller_name> poller pod logs' 

Example of 4.15 blocked feature PR (Wavelength Zones): https://github.com/openshift/installer/pull/7369#issuecomment-1783699175

Actual results:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7590/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones/1715075142427611136
{  1 pods lacked sampler output: [pod-network-to-pod-network-disruption-poller-d94fb55db-9qfpz]}

E1018 22:06:34.773866       1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close
E1018 22:06:34.774669       1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close

Expected results:

Monitor jobs be scheduled in edge nodes?
How we can track job failures for new monitor tests?

Additional info:

Edge nodes have NoSchedule taints applied by default, to run monitor pods in those nodes you need to tolerate the label "node-role.kubernetes.io/edge"

See the enhancement for more informaation: https://github.com/openshift/enhancements/blob/master/enhancements/installer/aws-custom-edge-machineset-local-zones.md#user-workload-deployments

Looking the must-gather of job 1716457254460329984, you can see the monitor pods not scheduled due the missing tolerations:

$ grep -rni pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2 \
  1716457254460329984-must-gather/09abb0d6fc08ee340563e6e11f5ceafb42fb371e50ab6acee6764031062525b7/namespaces/openshift-kube-scheduler/pods/ \
  | awk -F'] "' '{print$2}' | sort | uniq -c
    215 Unable to schedule pod; no fit; waiting" pod="e2e-pod-network-disruption-test-59s5d/pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2" 
err="0/7 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/edge: }, 
6 node(s) didn't match pod anti-affinity rules. preemption: 0/7 nodes are available: 
1 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod.."

clones

OCPBUGS-22703 Monitor tests are failing in Local Zone jobs (edge nodes)

Closed

is blocked by

OCPBUGS-22703 Monitor tests are failing in Local Zone jobs (edge nodes)

Closed

links to

openshift/origin#28387: [release-4.14] OCPBUGS-23042: tolerate AWS edge nodes on monitor tests

Assignee:: Marco Braga

Reporter:: OpenShift Prow Bot

Contributors:: Devan Goodwin, W. Trevor King

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/11/08 3:11 AM

Updated:: 2025/01/02 6:11 AM

Resolved:: 2025/01/02 6:11 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates