Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-54951

Multiple pods exiting an excessive amount of times on techpreview serial

XMLWordPrintable

    • Important
    • None
    • 3
    • MCO Sprint 269, MCO Sprint 270
    • 2
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [sig-architecture] platform pods in ns/openshift-cluster-node-tuning-operator should not exit an excessive amount of times

      Extreme regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 100.00% to 78.95%.

      Sample (being evaluated) Release: 4.19
      Start Time: 2025-04-07T00:00:00Z
      End Time: 2025-04-14T08:00:00Z
      Success Rate: 78.95%
      Successes: 15
      Failures: 4
      Flakes: 0

      Base (historical) Release: 4.18
      Start Time: 2025-03-15T00:00:00Z
      End Time: 2025-04-14T08:00:00Z
      Success Rate: 100.00%
      Successes: 63
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      This is just one report of many affected pods.

      [sig-architecture] platform pods in ns/openshift-cluster-csi-drivers should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-cluster-node-tuning-operator should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-dns should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-e2e-loki should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-image-registry should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-ingress-canary should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-insights should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-machine-config-operator should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-monitoring should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-multus should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-network-operator should not exit an excessive amount of times
      : [sig-architecture] platform pods in ns/openshift-ovn-kubernetes should not exit an excessive amount of times
      

      Viewing the main 4.19 board, then opening the regressed tests table top right, then filtering on "excessive" shows these failures. They are all techpreview serial jobs.

      Each failed test reports:

      namespace/openshift-cluster-csi-drivers node/ip-10-0-118-86.us-west-1.compute.internal pod/aws-ebs-csi-driver-node-ltqf7 uid/8f00a4fb-a131-44dd-9889-d86fcfd4fd12 container/csi-driver restarted 4 times at:
      non-zero exit at 2025-04-13 16:27:13.364043829 +0000 UTC m=+5548.621212562: cause/ContainerStatusUnknown code/137 reason/ContainerExit The container could not be located when the pod was deleted.  The container used to be Running
      non-zero exit at 2025-04-13 16:30:11.801567727 +0000 UTC m=+5727.058736510: cause/ContainerStatusUnknown code/137 reason/ContainerExit The container could not be located when the pod was deleted.  The container used to be Running
      non-zero exit at 2025-04-13 17:11:10.301761644 +0000 UTC m=+8185.558930387: cause/ContainerStatusUnknown code/137 reason/ContainerExit The container could not be located when the pod was deleted.  The container used to be Running
      non-zero exit at 2025-04-13 17:14:20.099871246 +0000 UTC m=+8375.357039979: cause/ContainerStatusUnknown code/137 reason/ContainerExit The container could not be located when the pod was deleted.  The container used to be Running
      
      namespace/openshift-cluster-csi-drivers node/ip-10-0-118-86.us-west-1.compute.internal pod/aws-ebs-csi-driver-node-ltqf7 uid/8f00a4fb-a131-44dd-9889-d86fcfd4fd12 container/csi-liveness-probe restarted 4 times at:
      non-zero exit at 2025-04-13 16:27:13.364046359 +0000 UTC m=+5548.621215092: cause/ContainerStatusUnknown code/137 reason/ContainerExit The container could not be located when the pod was deleted.  The container used to be Running
      non-zero exit at 2025-04-13 16:30:11.801570427 +0000 UTC m=+5727.058739160: cause/ContainerStatusUnknown code/137 reason/ContainerExit The container could not be located when the pod was deleted.  The container used to be Running
      non-zero exit at 2025-04-13 17:11:10.301764824 +0000 UTC m=+8185.558933557: cause/ContainerStatusUnknown code/137 reason/ContainerExit The container could not be located when the pod was deleted.  The container used to be Running
      non-zero exit at 2025-04-13 17:14:20.099873196 +0000 UTC m=+8375.357041929: cause/ContainerStatusUnknown code/137 reason/ContainerExit The container could not be located when the pod was deleted.  The container used to be Running
      

      Unclear why these are restarting right now. First failure was April 12, 9:16pm utc, since then 4 of 7 runs have hit this.

              rh-ee-ijanssen Isabella Janssen
              rhn-engineering-dgoodwin Devan Goodwin
              Prachiti Talgulkar Prachiti Talgulkar
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: