Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-8365

Business critical processes are stuck due to failed pipeline runes

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide
      Before this change, when a TaskRun fails to create a PVC due to a Resource Quota being having a conflicting concurrent update or being exhausted the Task Run is marked as failed. After this change, the TaskRun instead remains in the pending state and the PVC creation is retried
      Show
      Before this change, when a TaskRun fails to create a PVC due to a Resource Quota being having a conflicting concurrent update or being exhausted the Task Run is marked as failed. After this change, the TaskRun instead remains in the pending state and the PVC creation is retried
    • Bug Fix
    • Hide

      customer set disableCopiedCSV, ran etcd defragmentation, removed orphaned machineconfigs and update Pipelines to 1.19, none of them improved the situation

      Show
      customer set disableCopiedCSV, ran etcd defragmentation, removed orphaned machineconfigs and update Pipelines to 1.19, none of them improved the situation
    • Hide

      Customer is running pipelineruns with multiple volumeClaimTemplate from the same storageclass and clusterresourcequota for this storageclass

      Show
      Customer is running pipelineruns with multiple volumeClaimTemplate from the same storageclass and clusterresourcequota for this storageclass

      Form Initiator: anowak@redhat.com

      Customer Name: SUVA

      Business Impact:

      • The pipeline failures are disrupting core business operations. A critical nightly data processing job, which prepares worklogs for the following day, is failing approximately once per day.
      • This failure directly blocks up to 500 employees from performing their duties, leading to significant financial costs in terms of idle staff.
      • End customers are also impacted through potential service unavailability, false-positive alerts from failed E2E tests, and the inconvenience of having to manually restart failed builds.
      • SUVA is the Swiss National Accident Insurance Fund, which leads to delays in processing accident insurance claims for the majority of Swiss citizens

      Escalation Ticket: https://access.redhat.com/watchlist/internal/watchlist/87121

      Description:

      Running a pipeline in OpenShift pipelines with multiple PVC claims resulting in a error which is leading to a failed pipelinerun.
      Here is an example for the failed pipeline

      {"severity":"info","timestamp":"2025-05-05T11:45:06.307Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:377","message":"Event(v1.ObjectReference{Kind:\"PipelineRun\", Namespace:\"ssp-e2e-checker-prod\", Name:\"ssp-e2e-frontend-checker-continuous-test-89hmw\", UID:\"9e6c25f1-b8ad-455c-954c-20c9d318c265\", APIVersion:\"tekton.dev/v1\", ResourceVersion:\"1711766705\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' 1 error occurred:\n\t* PVC creation error: failed to create PVC pvc-7587279d4e: Operation cannot be fulfilled on clusterresourcequotas.quota.openshift.io \"suva-csi-request-storage\": the object has been modified; please apply your changes to the latest version and try again\n\n","commit":"c6d38c9d267c4776bbe8ee68af59f47dba5f7a07"}
      

      After this issue following pipelines which are depending on the failed one are stuck and employees can't process with the insurance claim.

      There is no workaround for this issue therefore a quick solution of the bug is requested.

      There was the assumption from engineering that this issue might get solved by SRVKP-7593 which has been shipped with OpenShift Pipelines 1.19, but this is not the case and it's not acceptable to the customer to wait another couple of months for a fix, because it's causing high efforts for fix the failed pipeline runs.

              vdemeest Vincent Demeester
              priority-request Priority Request
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: