Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-4427

remote task resolution does not retry on transient errors

XMLWordPrintable

    • 3
    • False
    • None
    • False
    • This fix address the lack of retry on transient kubernetes errors during remote resolution for tasks and pipelines.
    • Bug Fix
    • Pipelines Sprint Pioneers 2, Pipelines Sprint Pioneers 3, Pipelines Sprint Pioneers 4, Pipelines Sprint Pioneers 5, Pipelines Sprint Pioneers 6, Pipelines Sprint Pioneers 7, Pipelines Sprint Pioneers 8
    • Important

      A few slack threads exist, but the most active is https://redhat-internal.slack.com/archives/C04PZ7H0VA8/p1713252235282299

       

      During both sides of remote resolution (core controller and resolver) typically transient kubernetes errors were being treated as permanent knative errors and no attempts at trying to reconcile again were made, leading to failures which could be avoided.

      I've been collaborating with rh-ee-kbaig and sashture from openshift pipelines

      We have upstream PRs https://github.com/tektoncd/pipeline/pull/7894 and https://github.com/tektoncd/pipeline/pull/7893 up for this.

      The core server side logging also does not account for bundle based task names correctly.  If we can sort out that fix as part of our changes we will.  Otherwise, we'll open something separate for that.

      An example log snippet from the core controller
       Pipeline rh-acs-tenant/operator-on-pull-request-bwqxj can't be Run; it contains Tasks that don't exist: Couldn't retrieve Task "": retryable error validating referenced object source-build: Internal error occurred: failed calling webhook "validation.webhook.pipeline.tekton.dev": failed to call webhook: Post "https://tekton-pipelines-webhook.openshift-pipelines.svc:443/resource-validation?timeout=10s": context deadline exceeded
       
      Accompanying log snippet from the resolver
       {{

      {"level":"error","ts":"2024-04-17T10:50:05.866Z","logger":"controller","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"f0a1d64","knative.dev/traceid":"b893d6a6-2eb7-4a53-b502-1348803a7085","knative.dev/key":"rh-acs-tenant/bundles-780a1fe396cb0f8c702b34e9289fc770","duration":"10.3628985s","error":"error updating resource request \"rh-acs-tenant/bundles-780a1fe396cb0f8c702b34e9289fc770\" with data: Internal error occurred: failed calling webhook \"webhook.pipeline.tekton.dev\": failed to call webhook: Post \"https://tekton-pipelines-webhook.openshift-pipelines.svc:443/defaulting?timeout=10s\": context deadline exceeded","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:491"}

      }}

              gmontero@redhat.com Gabe Montero
              dbaez@redhat.com Danny Baez
              Khurram Baig, Savita .
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: