Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2030

Draining nodes may lead to short webhook services interruptions

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.11
    • OLM
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      this follows up on https://bugzilla.redhat.com/show_bug.cgi?id=2083396 after we fully identified its root cause.
      
      OK, this is fully reproducible and it's basically a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2048441
      
      Deleting the passive virt-operator pod causes it deployment to flicker a bit:
      
      oc delete pods -n openshift-cnv virt-operator-6f9c546f6d-4z7j7
      
      
      stirabos@t14s:~$ oc get deployment -n openshift-cnv virt-operator --watch
      NAME            READY   UP-TO-DATE   AVAILABLE   AGE
      virt-operator   2/2     2            2           69m
      virt-operator   1/2     1            1           73m
      virt-operator   1/2     2            1           73m
      virt-operator   1/2     2            1           73m
      virt-operator   1/2     2            1           73m
      virt-operator   1/2     2            1           73m
      virt-operator   1/2     2            1           73m
      virt-operator   1/2     2            1           73m
      virt-operator   1/2     2            1           73m
      virt-operator   1/2     2            1           73m
      virt-operator   1/2     2            1           73m
      virt-operator   2/2     2            2           74m
      
      and this is clearly detected by the OLM:
      stirabos@t14s:~$ oc logs --follow -n openshift-operator-lifecycle-manager  $(oc get pods -n openshift-operator-lifecycle-manager -lapp=olm-operator -o name)
      ...
      time="2022-09-02T14:11:57Z" level=warning msg="unhealthy component: waiting for deployment virt-operator to become ready: deployment \"virt-operator\" not available: Deployment does not have minimum availability." csv=kubevirt-hyperconverged-operator.v4.10.4 id=vvjS6 namespace=openshift-cnv phase=Succeeded strategy=deployment
      ...
      I0902 14:12:00.197421       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.4", UID:"d6f3c907-7df5-4c56-ab08-75e5e504d869", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"135920", FieldPath:""}): type: 'Normal' reason: 'NeedsReinstall' installing: waiting for deployment virt-operator to become ready: deployment "virt-operator" not available: Deployment does not have minimum availability.
      ...
      time="2022-09-02T14:12:00Z" level=info msg="scheduling ClusterServiceVersion for install" csv=kubevirt-hyperconverged-operator.v4.10.4 id=eb1Cv namespace=openshift-cnv phase=Pending
      I0902 14:12:00.927419       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.4", UID:"d6f3c907-7df5-4c56-ab08-75e5e504d869", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"135971", FieldPath:""}): type: 'Normal' reason: 'AllRequirementsMet' all requirements found, attempting install
      time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to"
      time="2022-09-02T14:12:01Z" level=warning msg="reusing existing cert hco-webhook-service-cert"
      time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to"
      time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to"
      time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to"
      ...
      I0902 14:12:07.150972       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.4", UID:"d6f3c907-7df5-4c56-ab08-75e5e504d869", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"136025", FieldPath:""}): type: 'Normal' reason: 'InstallSucceeded' waiting for install components to report healthy
      time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to"
      time="2022-09-02T14:12:07Z" level=warning msg="reusing existing cert hco-webhook-service-cert"
      time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to"
      time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to"
      time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to"
      
      Please notice that, as for https://bugzilla.redhat.com/show_bug.cgi?id=2048441#c24 , each time the OLM will try to reconcile an existing service it will delete and recreate it.
      See: https://github.com/operator-framework/operator-lifecycle-manager/blob/0fa6f2930dfd00c43e8e99c821f73f392da26378/pkg/controller/install/certresources.go#L255-L274
      
      So:
      1. deleting one of the replica of virt-operator will cause virt-operator deployment to have ready=1/2
      2. this should be handled by the deployment controller, but still it causes a reconciliation loop in the OLM that will mark phase=Pending first on the CSV and then InstallSucceeded again at the end.
      3. we see `reusing existing cert hco-webhook-service-cert` twice in the logs and so we are sure that the OLM executed installCertRequirementsForDeployment twice and we know that this deletes and recreates the service (twice)
      
      Please notice that this is true for all the OLM managed services and we have 4 of them:
      stirabos@t14s:~$ oc get services -n openshift-cnv -l=operators.coreos.com/kubevirt-hyperconverged.openshift-cnv
      NAME                                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
      hco-webhook-service                     ClusterIP   172.30.214.154   <none>        4343/TCP   20m
      hostpath-provisioner-operator-service   ClusterIP   172.30.76.194    <none>        9443/TCP   20m
      node-maintenance-operator-service       ClusterIP   172.30.252.218   <none>        443/TCP    20m
      ssp-operator-service                    ClusterIP   172.30.30.108    <none>        9443/TCP   20m

      Version-Release number of selected component (if applicable):

      OCP >= 4.6

      How reproducible:

      100%

      Steps to Reproduce:

      1. deploy Openshift Virtualization 
      2. manually delete one of the replica of virt-operator, this will cause virt-operator deployment to report ready=1/2
      3. the deployment controller will kick in to recover, this will causes a reconciliation loop in the OLM that will mark phase=Pending first on the CSV and then InstallSucceeded again at the end
      4. we see `reusing existing cert hco-webhook-service-cert` twice in the logs and so we are sure that the OLM executed installCertRequirementsForDeployment twice and we know that this deletes and recreates the service (twice)

      Actual results:

      on node drains or where one of the replicas of one of the OLM managed operators got killed, all the services for OLM managed webhook from the same CSV are killed and recreated twice with a small service disruption

      Expected results:

      OLM is not going to kill the services for its managed webhooks when not needed

      Additional info:

      see: https://bugzilla.redhat.com/show_bug.cgi?id=2083396

              pegoncal@redhat.com Per Goncalves da Silva
              stirabos Simone Tiraboschi
              None
              None
              Kui Wang Kui Wang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: