-
Story
-
Resolution: Done
-
Critical
-
None
-
None
-
None
-
None
-
False
-
None
-
False
-
None
-
None
-
None
-
None
-
None
TRT has identified a regression in this test which is currently blocking payload promotion for BOTH 4.10 nightlies and 4.10 ci.
The test failure message:
alert KubePersistentVolumeErrors fired for 210 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", persistentvolume="pvc-125ea951-e4b5-45d5-a096-559520654a9b", phase="Failed", service="kube-state-metrics", severity="warning"}
It appears to always be this same pod: kube-rbac-proxy-main
A sample prow job failing nightly payload:
Testgrid indicates we started tanking yesterday (Nov 15): https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade&show-stale-tests=
A sample prow job failing ci payload:
Testgrid also shows problem may have started yesterday: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&show-stale-tests=
Search indicates the rate this is occurring has picked up in the last two days as well:
https://search.ci.openshift.org/?search=KubePersistentVolumeErrors&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
TRT needs assistance debugging from storage or monitoring teams, key questions we're wondering:
- what is the PVC for
- what does the alert mean
- where is the alert defined
- what is the impact on monitoring
- why is the referenced PVC no longer present
- why are there PVCs created (don't seem to be mounted in pods)
- do you know of any PRs that have merged in the last days that may be causing this