Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.15.0
Component/s: Test Framework
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:

4.15.0
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The following pre submit jobs for Local Zones are perm failing since August:
- e2e-aws-ovn-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones?buildId=1716457254460329984
- e2e-aws-ovn-shared-vpc-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones

Investigating we can see common failures in tests '[sig-network] can collect <poller_name> poller pod logs', leading the most of jobs to not completed correctly for those failures.

Exploring the code I can see it was recently added, near August and matches with when the failures started.

It is required to tolerate the label "node-role.kubernetes.io/edge" to run pods on instances located in Local Zone ("edge nodes"). I am not sure if I am looking in the correct place, but it seems it is tolerating only master labels: https://github.com/openshift/origin/blob/master/pkg/monitortests/network/disruptionpodnetwork/host-network-target-deployment.yaml#L42

Version-Release number of selected component (if applicable):

4.15.0

How reproducible:

always

Steps to Reproduce:

trigger the job:
1. open a PR on installer
2. run the job
3. check failed tests '[sig-network] can collect <poller_name> poller pod logs' 

Example of 4.15 blocked feature PR (Wavelength Zones): https://github.com/openshift/installer/pull/7369#issuecomment-1783699175

Actual results:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7590/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones/1715075142427611136
{  1 pods lacked sampler output: [pod-network-to-pod-network-disruption-poller-d94fb55db-9qfpz]}

E1018 22:06:34.773866       1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close
E1018 22:06:34.774669       1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close

Expected results:

Monitor jobs be scheduled in edge nodes?
How we can track job failures for new monitor tests?

Additional info:

Edge nodes have NoSchedule taints applied by default, to run monitor pods in those nodes you need to tolerate the label "node-role.kubernetes.io/edge"

See the enhancement for more informaation: https://github.com/openshift/enhancements/blob/master/enhancements/installer/aws-custom-edge-machineset-local-zones.md#user-workload-deployments

Looking the must-gather of job 1716457254460329984, you can see the monitor pods not scheduled due the missing tolerations:

$ grep -rni pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2 \
  1716457254460329984-must-gather/09abb0d6fc08ee340563e6e11f5ceafb42fb371e50ab6acee6764031062525b7/namespaces/openshift-kube-scheduler/pods/ \
  | awk -F'] "' '{print$2}' | sort | uniq -c
    215 Unable to schedule pod; no fit; waiting" pod="e2e-pod-network-disruption-test-59s5d/pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2" 
err="0/7 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/edge: }, 
6 node(s) didn't match pod anti-affinity rules. preemption: 0/7 nodes are available: 
1 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod.."

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

openshift-tests-junits-20231030.tar.bz
2023/10/31 4:37 PM
4.52 MB
Marco Braga

blocks

SPLAT-1125 [aws] Add support to AWS Wavelength - Day 0 Fully automated

Closed

SPLAT-1218 [aws] Add support to AWS Wavelength - Day 0 BYO VPC

Closed

OCPBUGS-23042 Monitor tests are failing in Local Zone jobs (edge nodes)

Closed

is cloned by

OCPBUGS-23042 Monitor tests are failing in Local Zone jobs (edge nodes)

Closed

is related to

SPLAT-657 AWS Local Zones - Phase II - IPI automation - Installer support to create resources in Local Zone for edge pool

Closed

relates to

SPLAT-1225 [aws][local-zones][CI] Investigate jobs failing

Closed

links to

openshift/origin#28363: OCPBUGS-22703: tolerate AWS edge nodes on monitor tests

(1 relates to, 1 links to)

Assignee:: Marco Braga

Reporter:: Marco Braga

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/10/30 6:24 PM

Updated:: 2025/07/25 5:29 AM

Resolved:: 2023/11/10 5:17 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates