Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
None
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

TRT has identified a regression in this test which is currently blocking payload promotion for BOTH 4.10 nightlies and 4.10 ci.

The test failure message:

alert KubePersistentVolumeErrors fired for 210 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", persistentvolume="pvc-125ea951-e4b5-45d5-a096-559520654a9b", phase="Failed", service="kube-state-metrics", severity="warning"}

It appears to always be this same pod: kube-rbac-proxy-main

A sample prow job failing nightly payload:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade/1460468178256662528

Testgrid indicates we started tanking yesterday (Nov 15): https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade&show-stale-tests=

A sample prow job failing ci payload:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1460525680675524608

Testgrid also shows problem may have started yesterday: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&show-stale-tests=

Search indicates the rate this is occurring has picked up in the last two days as well:
https://search.ci.openshift.org/?search=KubePersistentVolumeErrors&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

TRT needs assistance debugging from storage or monitoring teams, key questions we're wondering:

what is the PVC for
what does the alert mean
where is the alert defined
what is the impact on monitoring
why is the referenced PVC no longer present
why are there PVCs created (don't seem to be mounted in pods)
do you know of any PRs that have merged in the last days that may be causing this

Assignee:: Unassigned

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2021/11/16 3:02 PM

Updated:: 2022/03/15 8:17 PM

Resolved:: 2021/11/18 3:23 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates