Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- CI
- aws
- ccm
- ready-for-development
- upstream

Activity Type:
Quality / Stability / Reliability
Blocked:
True
Blocked Reason:
Hide

2025.08.07: A documentation PR has been updated adding references of the lessons learned from the investigation that generated the resource bump. Awaiting for reviewer: https://github.com/kubernetes/cloud-provider-aws/pull/1221

2025.08.06: Awaiting for feedback on job change on e2e limit: https://github.com/kubernetes/test-infra/pull/35274
Show
2025.08.07: A documentation PR has been updated adding references of the lessons learned from the investigation that generated the resource bump. Awaiting for reviewer: https://github.com/kubernetes/cloud-provider-aws/pull/1221 2025.08.06: Awaiting for feedback on job change on e2e limit: https://github.com/kubernetes/test-infra/pull/35274
Ready:
False
Epic Link:
SPLAT-2137
Story Points:
5
Original story points:
3

Target Version:
None
Release Blocker:
None
Sprint:
OpenShift SPLAT - Sprint 275

User Story:
As an OpenShift Engineer I want to report and propose investigation on CI of jobs randomly getting OOMKilled by Prow, impacting the feature readiness, so that we can increase velocity and confidence on features proposed to upstream cloud-provider-aws

Description:
< Record any background information >

e2e job which is frequently OOMKilled so . Refs <https://kubernetes.slack.com/archives/C7J9RP96G/p1754511876402269?thread_ts=1754505741.634999&cid=C7J9RP96G>.
cloud-provider-aws (upstream) CI information https://github.com/kubernetes/cloud-provider-aws/blob/master/docs/development.md#ci-test-infrastructure
CI monitoring dashboard (filtered in the period of job got stuck):
- Boskos Resource Usage
- Kubernetes > Jobs
Jobs overview dashboard https://monitoring-eks.prow.k8s.io/d/53g2x7OZz/jobs?orgId=1&var-org=kubernetes&var-repo=cloud-provider-aws&var-job=All&from=1754492403103&to=1754494500102
Job examples:
- Failed job:
  - https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1158/pull-cloud-provider-aws-e2e/1953110200760143872
  - https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&var-org=kubernetes&var-repo=cloud-provider-aws&var-job=pull-cloud-provider-aws-e2e&var-build=All&from=1754433557755&to=1754438416037
  - Dashboard snapshots:
    - All job run: Screenshot From Screenshot From 2025-08-06 21-05-59.png
    - Failed step time frame (with using above limits) : Screenshot From 2025-08-06 21-06-13.png
- Succeeded job:

Acceptance Criteria:

Open an upstream issue reporting the problem
Open an PR proposing increasing the job resource limit
Check any room for optimization to the step (maybe using pre-built kops binary instead of downloading every time?)
Open a PR updating development document, CI section, to upstream refrencing the Grafana dashboard

Other Information:
< Record anything else that may be helpful to someone else picking up the card >

issue created by splat-bot

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screenshot From 2025-08-06 21-03-32.png
2025/08/07 12:09 AM
176 kB
Marco Braga
Screenshot From 2025-08-06 21-06-13.png
2025/08/07 12:09 AM
184 kB
Marco Braga
Screenshot From 2025-08-06 21-05-59.png
2025/08/07 12:09 AM
177 kB
Marco Braga

links to

https://github.com/kubernetes/cloud-provider-aws/pull/1221

https://github.com/kubernetes/test-infra/pull/35274

Prow Job not reporting OOMKilled https://github.com/kubernetes-sigs/prow/issues/210

Assignee:: Marco Braga

Reporter:: Marco Braga

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/08/06 8:47 PM

Updated:: 2025/08/13 1:09 PM

Resolved:: 2025/08/13 1:09 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates