Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-55

Investigate regression in alerts firing on Azure: KubePersistentVolumeErrors

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • None
    • None
    • False
    • None
    • False
    • None
    • None
    • None
    • None

      TRT has identified a regression in this test which is currently blocking payload promotion for BOTH 4.10 nightlies and 4.10 ci.

      The test failure message:

      alert KubePersistentVolumeErrors fired for 210 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", persistentvolume="pvc-125ea951-e4b5-45d5-a096-559520654a9b", phase="Failed", service="kube-state-metrics", severity="warning"}
      

      It appears to always be this same pod: kube-rbac-proxy-main

      A sample prow job failing nightly payload:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade/1460468178256662528

      Testgrid indicates we started tanking yesterday (Nov 15): https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade&show-stale-tests=

      A sample prow job failing ci payload:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1460525680675524608

      Testgrid also shows problem may have started yesterday: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&show-stale-tests=

      Search indicates the rate this is occurring has picked up in the last two days as well:
      https://search.ci.openshift.org/?search=KubePersistentVolumeErrors&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

      TRT needs assistance debugging from storage or monitoring teams, key questions we're wondering:

      1. what is the PVC for
      2. what does the alert mean
      3. where is the alert defined
      4. what is the impact on monitoring
      5. why is the referenced PVC no longer present
      6. why are there PVCs created (don't seem to be mounted in pods)
      7. do you know of any PRs that have merged in the last days that may be causing this

              Unassigned Unassigned
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: