Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-28845

OLM continuous reconciles some CSVs with NeedsReinstall due to webhooks not installed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • 4.15
    • OLM
    • None
    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Some Operator CSVs are continuously reconciled, having been detected as NeedsReinstall. Reconciliation fails as OLM attempts to create, not apply, resources that already exist. This causes rapidly flapping status in the dashboard, high CPU load for the OLM pod, and CsvAbnormalOver30Min and CsvAbnormalFailedOver2Min alerts to fire for all affected operators due to NeedsReinstall or InstallComponentFailed (semi-random, depending on what phase of the continuous loop OLM was in when the alert fired).

      Version-Release number of selected component (if applicable):

      OKD 4.15.0-0.okd-2024-01-27-070424

      How reproducible:

      Always, with certain operators. This being on OKD, I've got the following results for OLM-managed operators:
      
      - ArgoCD (OperatorHub.io catalog) **Not exhibiting**
      - DevWorkspace Operator (custom devworkspace catalog) **Exhibiting**
      - Eclipse Che (OKD Community Operators catalog) **Exhibiting**
      - Grafana Operator (OperatorHub.io catalog) **Not Exhibiting**
      - KubeVirt Hyperconverged (OperatorHub.io catalog) **Exhibiting**
      - Crunchy Postgres (OperatorHub.io catalog) **Not Exhibiting**

      Steps to Reproduce:

          1. Install OKD or OpenShift 4.15
          2. Install operators from OperatorHub or using the OLM APIs
          3. Install a mix of operators that have webhook definitions in their CSV and those that don't
          

      Actual results:

      Some operators transition to Succeeded status and stay there while others loop continuously through Failed, Pending, and InstallReady.

      Expected results:

      All operators install and OLM stops reconciling.

      Additional info:

      Since this is OKD, I'll just attach a must-gather here:
      https://s3.jharmison.com/public/must-gather-cleaned.tgz

      This cluster is public and I can let anyone log in and poke around if you think it would be helpful, or can go collect any extra logs or anything you like. I only know enough about OLM internals to be dangerous.

      Here is a snippet of the OLM logs during this time period:

      2024-01-31T21:07:33.737245535Z {"level":"error","ts":"2024-01-31T21:07:33Z","logger":"controllers.operator","msg":"Could not update Operator status","request":{"name":"eclipse-che.openshift-operators"},"error":"Ope
      ration cannot be fulfilled on operators.operators.coreos.com \"eclipse-che.openshift-operators\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.co
      m/operator-framework/operator-lifecycle-manager/pkg/controller/operators.(*OperatorReconciler).Reconcile\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/operator_c
      ontroller.go:157\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-run
      time/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Contro
      ller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/vend
      or/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"}
      2024-01-31T21:07:33.946396443Z E0131 21:07:33.946362       1 queueinformer_operator.go:319] sync {"update" "openshift-operators/eclipse-che.v7.80.0"} failed: rolebindings.rbac.authorization.k8s.io "che-operator-ser
      vice-auth-reader" already exists
      2024-01-31T21:07:34.219414659Z time="2024-01-31T21:07:34Z" level=info msg="scheduling ClusterServiceVersion for install" csv=devworkspace-operator.v0.25.0 id=ENH8l namespace=openshift-operators phase=Pending
      2024-01-31T21:07:34.219573625Z I0131 21:07:34.219544       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"devworkspace-operator.v0.25.0", UID:"4d42c03f
      -837b-4008-ad59-00fbb6f13c87", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852735", FieldPath:""}): type: 'Normal' reason: 'AllRequirementsMet' all requirements found, attempting install
      2024-01-31T21:07:35.062259438Z time="2024-01-31T21:07:35Z" level=warning msg="needs reinstall: webhooks not installed" csv=kubevirt-hyperconverged-operator.v1.10.1 id=ZGSEA namespace=kubevirt-hyperconverged phase=F
      ailed strategy=deployment
      2024-01-31T21:07:35.062363864Z I0131 21:07:35.062286       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"kubevirt-hyperconverged", Name:"kubevirt-hyperconverged-operator.v1.10.1"
      , UID:"7d9ddf57-8d63-4a8d-a20f-86a1884709aa", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852737", FieldPath:""}): type: 'Normal' reason: 'NeedsReinstall' webhooks not installed
      2024-01-31T21:07:35.121364568Z I0131 21:07:35.121328       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"eclipse-che.v7.80.0", UID:"efdefaa8-1ba4-4fb5
      -ae6e-05fc6c9a051a", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852772", FieldPath:""}): type: 'Normal' reason: 'NeedsReinstall' calculated deployment install is bad
      2024-01-31T21:07:35.683718848Z time="2024-01-31T21:07:35Z" level=warning msg="reusing existing cert devworkspace-controller-manager-service-cert"
      2024-01-31T21:07:35.793138602Z time="2024-01-31T21:07:35Z" level=warning msg="could not create auth reader role binding devworkspace-controller-manager-service-auth-reader"
      2024-01-31T21:07:35.793329438Z I0131 21:07:35.793304       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"devworkspace-operator.v0.25.0", UID:"4d42c03f
      -837b-4008-ad59-00fbb6f13c87", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852784", FieldPath:""}): type: 'Warning' reason: 'InstallComponentFailed' install strategy failed: rolebindings.rbac.au
      thorization.k8s.io "devworkspace-controller-manager-service-auth-reader" already exists
      2024-01-31T21:07:35.793669069Z {"level":"error","ts":"2024-01-31T21:07:35Z","logger":"controllers.operator","msg":"Could not update Operator status","request":{"name":"devworkspace-operator.openshift-operators"},"e
      rror":"Operation cannot be fulfilled on operators.operators.coreos.com \"devworkspace-operator.openshift-operators\": the object has been modified; please apply your changes to the latest version and try again","st
      acktrace":"github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators.(*OperatorReconciler).Reconcile\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/
      operators/operator_controller.go:157\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k
      8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal
      /controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.fu
      nc2.2\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"}
      2024-01-31T21:07:35.961587314Z E0131 21:07:35.961556       1 queueinformer_operator.go:319] sync {"update" "openshift-operators/devworkspace-operator.v0.25.0"} failed: rolebindings.rbac.authorization.k8s.io "devwor
      kspace-controller-manager-service-auth-reader" already exists
      2024-01-31T21:07:36.418866061Z time="2024-01-31T21:07:36Z" level=info msg="scheduling ClusterServiceVersion for install" csv=eclipse-che.v7.80.0 id=OijjL namespace=openshift-operators phase=Pending
      2024-01-31T21:07:36.418932412Z I0131 21:07:36.418903       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"eclipse-che.v7.80.0", UID:"efdefaa8-1ba4-4fb5
      -ae6e-05fc6c9a051a", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852808", FieldPath:""}): type: 'Normal' reason: 'AllRequirementsMet' all requirements found, attempting install
      2024-01-31T21:07:37.222110856Z time="2024-01-31T21:07:37Z" level=info msg="scheduling ClusterServiceVersion for install" csv=kubevirt-hyperconverged-operator.v1.10.1 id=K1PkY namespace=kubevirt-hyperconverged phase
      =Pending
      2024-01-31T21:07:37.222218355Z I0131 21:07:37.222184       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"kubevirt-hyperconverged", Name:"kubevirt-hyperconverged-operator.v1.10.1"
      , UID:"7d9ddf57-8d63-4a8d-a20f-86a1884709aa", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852806", FieldPath:""}): type: 'Normal' reason: 'AllRequirementsMet' all requirements found, attempting
      install
      2024-01-31T21:07:37.530752857Z time="2024-01-31T21:07:37Z" level=warning msg="needs reinstall: missing deployment with name=devworkspace-controller-manager" csv=devworkspace-operator.v0.25.0 id=bJrRG namespace=open
      shift-operators phase=Failed strategy=deployment
      2024-01-31T21:07:37.530927320Z I0131 21:07:37.530899       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"devworkspace-operator.v0.25.0", UID:"4d42c03f
      -837b-4008-ad59-00fbb6f13c87", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852843", FieldPath:""}): type: 'Normal' reason: 'NeedsReinstall' installing: missing deployment with name=devworkspace-
      controller-manager
      2024-01-31T21:07:37.944687196Z time="2024-01-31T21:07:37Z" level=warning msg="reusing existing cert che-operator-service-cert"
      2024-01-31T21:07:38.057979414Z time="2024-01-31T21:07:38Z" level=warning msg="could not create auth reader role binding che-operator-service-auth-reader"
      2024-01-31T21:07:38.058233728Z I0131 21:07:38.058078       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"eclipse-che.v7.80.0", UID:"efdefaa8-1ba4-4fb5
      -ae6e-05fc6c9a051a", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852863", FieldPath:""}): type: 'Warning' reason: 'InstallComponentFailed' install strategy failed: rolebindings.rbac.authorizatio
      n.k8s.io "che-operator-service-auth-reader" already exists
      2024-01-31T21:07:38.238593871Z E0131 21:07:38.238566       1 queueinformer_operator.go:319] sync {"update" "openshift-operators/eclipse-che.v7.80.0"} failed: rolebindings.rbac.authorization.k8s.io "che-operator-ser
      vice-auth-reader" already exists
      2024-01-31T21:07:38.519712490Z time="2024-01-31T21:07:38Z" level=info msg="scheduling ClusterServiceVersion for install" csv=devworkspace-operator.v0.25.0 id=AMYTI namespace=openshift-operators phase=Pending
      2024-01-31T21:07:38.519875885Z I0131 21:07:38.519846       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"devworkspace-operator.v0.25.0", UID:"4d42c03f
      -837b-4008-ad59-00fbb6f13c87", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852895", FieldPath:""}): type: 'Normal' reason: 'AllRequirementsMet' all requirements found, attempting install
      2024-01-31T21:07:38.570067396Z time="2024-01-31T21:07:38Z" level=info msg="No api or webhook descs to add CA to"
      2024-01-31T21:07:38.626582597Z time="2024-01-31T21:07:38Z" level=warning msg="reusing existing cert hco-webhook-service-cert"
      2024-01-31T21:07:38.739072314Z time="2024-01-31T21:07:38Z" level=warning msg="could not create auth reader role binding hco-webhook-service-auth-reader"
      2024-01-31T21:07:38.739547134Z I0131 21:07:38.739508       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"kubevirt-hyperconverged", Name:"kubevirt-hyperconverged-operator.v1.10.1"
      , UID:"7d9ddf57-8d63-4a8d-a20f-86a1884709aa", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852893", FieldPath:""}): type: 'Warning' reason: 'InstallComponentFailed' install strategy failed: roleb
      indings.rbac.authorization.k8s.io "hco-webhook-service-auth-reader" already exists
      2024-01-31T21:07:39.117949608Z I0131 21:07:39.117908       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"eclipse-che.v7.80.0", UID:"efdefaa8-1ba4-4fb5
      -ae6e-05fc6c9a051a", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"863852934", FieldPath:""}): type: 'Normal' reason: 'NeedsReinstall' calculated deployment install is bad
      2024-01-31T21:07:39.124653014Z E0131 21:07:39.124612       1 queueinformer_operator.go:319] sync {"update" "kubevirt-hyperconverged/kubevirt-hyperconverged-operator.v1.10.1"} failed: rolebindings.rbac.authorization
      .k8s.io "hco-webhook-service-auth-reader" already exists 

              rh-ee-dfranz Daniel Franz
              jharmison James Harmison (Inactive)
              Votes:
              6 Vote for this issue
              Watchers:
              17 Start watching this issue

                Created:
                Updated:
                Resolved: