Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-6771

Operator installation/upgrade fails with "Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline

XMLWordPrintable

    • Moderate
    • Grumpy 241, Happy 242, INKEY$ (OPRUN 243)
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, if an Operator installation or upgrade took longer than 10 minutes, the operation could fail with the following error:
      +
      [source,text]
      ====
      Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline".
      ====
      +
      This issue occurred because Operator Lifecycle Manager (OLM) had a bundle unpacking job that was configured with a timeout of 600 seconds. Bundle unpack jobs could fail due to network or configuration issues in the cluster that might be transient or resolved with user intervention. With this bug fix, OLM automates the recreation of failed unpack jobs indefinitely by default.
      +
      This update also adds the optional `operatorframework.io/bundle-unpack-min-retry-interval` annotation for Operator groups to configure a minimum interval to wait before attempting to recreate the failed job. (link:https://issues.redhat.com/browse/OCPBUGS-6771[*OCPBUGS-6771*])
      Show
      * Previously, if an Operator installation or upgrade took longer than 10 minutes, the operation could fail with the following error: + [source,text] ==== Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline". ==== + This issue occurred because Operator Lifecycle Manager (OLM) had a bundle unpacking job that was configured with a timeout of 600 seconds. Bundle unpack jobs could fail due to network or configuration issues in the cluster that might be transient or resolved with user intervention. With this bug fix, OLM automates the recreation of failed unpack jobs indefinitely by default. + This update also adds the optional `operatorframework.io/bundle-unpack-min-retry-interval` annotation for Operator groups to configure a minimum interval to wait before attempting to recreate the failed job. (link: https://issues.redhat.com/browse/OCPBUGS-6771 [* OCPBUGS-6771 *])
    • Bug Fix
    • Done

      Description of problem:

      Operator installation/upgrade fails stating: "Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline"

      Version-Release number of selected component (if applicable):

      4.10

      How reproducible:

      oc -n openshift-marketplace get job 14359dfdd866df54d278e75b42202a5af9ce0cefdf416216dd11e09e4dec25e -o yaml
      apiVersion: batch/v1
      kind: Job
      metadata:
        creationTimestamp: "2022-08-04T12:54:19Z"
        generation: 1
        labels:
          controller-uid: e236f157-ab03-4153-b095-b6b1a97ef3c8
          job-name: 14359dfdd866df54d278e75b42202a5af9ce0cefdf416216dd11e09e4dec25e
        name: 14359dfdd866df54d278e75b42202a5af9ce0cefdf416216dd11e09e4dec25e
        namespace: openshift-marketplace
        ownerReferences:
        - apiVersion: v1
          blockOwnerDeletion: false
          controller: false
          kind: ConfigMap
          name: 14359dfdd866df54d278e75b42202a5af9ce0cefdf416216dd11e09e4dec25e
          uid: 2d6d332d-e680-4828-b97f-e6024b34575b
        resourceVersion: "1299311475"
        uid: e236f157-ab03-4153-b095-b6b1a97ef3c8
      spec:
        activeDeadlineSeconds: 600
        backoffLimit: 3
        completionMode: NonIndexed
        completions: 1
        parallelism: 1
        selector:
          matchLabels:
            controller-uid: e236f157-ab03-4153-b095-b6b1a97ef3c8
        suspend: false
        template:
          metadata:
            creationTimestamp: null
            labels:
              controller-uid: e236f157-ab03-4153-b095-b6b1a97ef3c8
              job-name: 14359dfdd866df54d278e75b42202a5af9ce0cefdf416216dd11e09e4dec25e
            name: 14359dfdd866df54d278e75b42202a5af9ce0cefdf416216dd11e09e4dec25e
          spec:
            containers:
            - command:
              - opm
              - alpha
              - bundle
              - extract
              - -m
              - /bundle/
              - -n
              - openshift-marketplace
              - -c
              - 14359dfdd866df54d278e75b42202a5af9ce0cefdf416216dd11e09e4dec25e
              - -z
              env:
              - name: CONTAINER_IMAGE
                value: registry.redhat.io/openshift-logging/cluster-logging-operator-bundle@sha256:d19c4b7b67a70b46b6b3ac43b2f285cc19c52f2795c8dfbea4315bd06e7485ca
              image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8de7a35f7ca26e678b8e3d8bf5fa6aa80b84287413247dc031a785d0d139698c
              imagePullPolicy: IfNotPresent
              name: extract
              resources:
                requests:
                  cpu: 10m
                  memory: 50Mi
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
              - mountPath: /bundle
                name: bundle
            dnsPolicy: ClusterFirst
            initContainers:
            - command:
              - /bin/cp
              - -Rv
              - /bin/cpb
              - /util/cpb
              image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cc477d763835d8c874b050223261dde5bcd73429f0cb55aa7f7cde3df892ce0f
              imagePullPolicy: IfNotPresent
              name: util
              resources:
                requests:
                  cpu: 10m
                  memory: 50Mi
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
              - mountPath: /util
                name: util
            - command:
              - /util/cpb
              - /bundle
              image: registry.redhat.io/openshift-logging/cluster-logging-operator-bundle@sha256:d19c4b7b67a70b46b6b3ac43b2f285cc19c52f2795c8dfbea4315bd06e7485ca
              imagePullPolicy: Always
              name: pull
              resources:
                requests:
                  cpu: 10m
                  memory: 50Mi
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
              - mountPath: /bundle
                name: bundle
              - mountPath: /util
                name: util
            restartPolicy: Never
            schedulerName: default-scheduler
            securityContext: {}
            terminationGracePeriodSeconds: 30
            volumes:
            - emptyDir: {}
              name: bundle
            - emptyDir: {}
              name: util
      status:
        conditions:
        - lastProbeTime: "2022-08-04T13:04:19Z"
          lastTransitionTime: "2022-08-04T13:04:19Z"
          message: Job was active longer than specified deadline
          reason: DeadlineExceeded
          status: "True"
          type: Failed
        failed: 1
        startTime: "2022-08-04T12:54:19Z"
      
      
      oc -n openshift-logging get installplan install-qzrfp -o yaml
      apiVersion: operators.coreos.com/v1alpha1
      kind: InstallPlan
      metadata:
        creationTimestamp: "2022-08-04T12:54:19Z"
        generateName: install-
        generation: 1
        labels:
          operators.coreos.com/cluster-logging.openshift-logging: ""
        name: install-qzrfp
        namespace: openshift-logging
        ownerReferences:
        - apiVersion: operators.coreos.com/v1alpha1
          blockOwnerDeletion: false
          controller: false
          kind: Subscription
          name: cluster-logging-subscription
          uid: 48580ca3-bd57-449e-84ec-84efc8c8035d
        resourceVersion: "1299311512"
        uid: cd93ba60-b8db-448f-9239-1c8b15059eef
      spec:
        approval: Automatic
        approved: true
        clusterServiceVersionNames:
        - cluster-logging.5.4.4
        generation: 26
      status:
        bundleLookups:
        - catalogSourceRef:
            name: redhat-operators
            namespace: openshift-marketplace
          conditions:
          - message: bundle contents have not yet been persisted to installplan status
            reason: BundleNotUnpacked
            status: "True"
            type: BundleLookupNotPersisted
          - lastTransitionTime: "2022-08-04T12:54:19Z"
            message: 'unpack job not completed: Unpack pod(openshift-marketplace/14359dfdd866df54d278e75b42202a5af9ce0cefdf416216dd11e09e4d5l7rv)
              container(pull) is pending. Reason: ImagePullBackOff, Message: Back-off pulling
              image "registry.redhat.io/openshift-logging/cluster-logging-operator-bundle@sha256:d19c4b7b67a70b46b6b3ac43b2f285cc19c52f2795c8dfbea4315bd06e7485ca"'
            reason: JobIncomplete
            status: "True"
            type: BundleLookupPending
          - lastTransitionTime: "2022-08-04T13:04:20Z"
            message: Job was active longer than specified deadline
            reason: DeadlineExceeded
            status: "True"
            type: BundleLookupFailed
          identifier: cluster-logging.5.4.4
          path: registry.redhat.io/openshift-logging/cluster-logging-operator-bundle@sha256:d19c4b7b67a70b46b6b3ac43b2f285cc19c52f2795c8dfbea4315bd06e7485ca
          properties: '{"properties":[{"type":"olm.package","value":{"packageName":"cluster-logging","version":"5.4.4"}},{"type":"olm.maxOpenShiftVersion","value":"4.11"},{"type":"olm.gvk","value":{"group":"logging.openshift.io","kind":"ClusterLogForwarder","version":"v1"}},{"type":"olm.gvk","value":{"group":"logging.openshift.io","kind":"ClusterLogging","version":"v1"}}]}'
          replaces: cluster-logging.5.4.3
        catalogSources: []
        conditions:
        - lastTransitionTime: "2022-08-04T13:04:20Z"
          lastUpdateTime: "2022-08-04T13:04:20Z"
          message: 'Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job
            was active longer than specified deadline'
          reason: InstallCheckFailed
          status: "False"
          type: Installed
        phase: Failed
      
      The solution from https://access.redhat.com/solutions/6459071 works and helps to eventually complete the Operator upgrade. But it's rather nasty if this kind of activtiy needs to be done on +10 OpenShift Container Platform 4 - Cluster and it's therefore requested to further investigate the root cause and make the overall process more robust.

      Steps to Reproduce:

      Seen often when upgrading Operators

      Actual results:

      Operator upgrade is failing and steps from https://access.redhat.com/solutions/6459071 needs to be applied to resume and eventually complete the upgrade
      

      Expected results:

      Operator upgrade should complete as expected without hitting problem even when there are certain resource or networking constrains. The timeout should be big enough to cope with many different situation/conditon and otherwise should report what is causing the problem.

      Additional info:

      https://access.redhat.com/solutions/6459071
      
      Around 100+ cases have used above article to resolve this issue and a large number of people are affected.

            ankithom Ankita Thomas
            rhn-support-jkaur Jaspreet Kaur (Inactive)
            Xia Zhao Xia Zhao
            Alex Dellapenta Alex Dellapenta
            Daniel Messer
            Votes:
            0 Vote for this issue
            Watchers:
            35 Start watching this issue

              Created:
              Updated:
              Resolved: