Description of problem:
this follows up on https://bugzilla.redhat.com/show_bug.cgi?id=2083396 after we fully identified its root cause. OK, this is fully reproducible and it's basically a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2048441 Deleting the passive virt-operator pod causes it deployment to flicker a bit: oc delete pods -n openshift-cnv virt-operator-6f9c546f6d-4z7j7 stirabos@t14s:~$ oc get deployment -n openshift-cnv virt-operator --watch NAME READY UP-TO-DATE AVAILABLE AGE virt-operator 2/2 2 2 69m virt-operator 1/2 1 1 73m virt-operator 1/2 2 1 73m virt-operator 1/2 2 1 73m virt-operator 1/2 2 1 73m virt-operator 1/2 2 1 73m virt-operator 1/2 2 1 73m virt-operator 1/2 2 1 73m virt-operator 1/2 2 1 73m virt-operator 1/2 2 1 73m virt-operator 1/2 2 1 73m virt-operator 2/2 2 2 74m and this is clearly detected by the OLM: stirabos@t14s:~$ oc logs --follow -n openshift-operator-lifecycle-manager $(oc get pods -n openshift-operator-lifecycle-manager -lapp=olm-operator -o name) ... time="2022-09-02T14:11:57Z" level=warning msg="unhealthy component: waiting for deployment virt-operator to become ready: deployment \"virt-operator\" not available: Deployment does not have minimum availability." csv=kubevirt-hyperconverged-operator.v4.10.4 id=vvjS6 namespace=openshift-cnv phase=Succeeded strategy=deployment ... I0902 14:12:00.197421 1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.4", UID:"d6f3c907-7df5-4c56-ab08-75e5e504d869", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"135920", FieldPath:""}): type: 'Normal' reason: 'NeedsReinstall' installing: waiting for deployment virt-operator to become ready: deployment "virt-operator" not available: Deployment does not have minimum availability. ... time="2022-09-02T14:12:00Z" level=info msg="scheduling ClusterServiceVersion for install" csv=kubevirt-hyperconverged-operator.v4.10.4 id=eb1Cv namespace=openshift-cnv phase=Pending I0902 14:12:00.927419 1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.4", UID:"d6f3c907-7df5-4c56-ab08-75e5e504d869", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"135971", FieldPath:""}): type: 'Normal' reason: 'AllRequirementsMet' all requirements found, attempting install time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to" time="2022-09-02T14:12:01Z" level=warning msg="reusing existing cert hco-webhook-service-cert" time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to" time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to" time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to" ... I0902 14:12:07.150972 1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.4", UID:"d6f3c907-7df5-4c56-ab08-75e5e504d869", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"136025", FieldPath:""}): type: 'Normal' reason: 'InstallSucceeded' waiting for install components to report healthy time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to" time="2022-09-02T14:12:07Z" level=warning msg="reusing existing cert hco-webhook-service-cert" time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to" time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to" time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to" Please notice that, as for https://bugzilla.redhat.com/show_bug.cgi?id=2048441#c24 , each time the OLM will try to reconcile an existing service it will delete and recreate it. See: https://github.com/operator-framework/operator-lifecycle-manager/blob/0fa6f2930dfd00c43e8e99c821f73f392da26378/pkg/controller/install/certresources.go#L255-L274 So: 1. deleting one of the replica of virt-operator will cause virt-operator deployment to have ready=1/2 2. this should be handled by the deployment controller, but still it causes a reconciliation loop in the OLM that will mark phase=Pending first on the CSV and then InstallSucceeded again at the end. 3. we see `reusing existing cert hco-webhook-service-cert` twice in the logs and so we are sure that the OLM executed installCertRequirementsForDeployment twice and we know that this deletes and recreates the service (twice) Please notice that this is true for all the OLM managed services and we have 4 of them: stirabos@t14s:~$ oc get services -n openshift-cnv -l=operators.coreos.com/kubevirt-hyperconverged.openshift-cnv NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE hco-webhook-service ClusterIP 172.30.214.154 <none> 4343/TCP 20m hostpath-provisioner-operator-service ClusterIP 172.30.76.194 <none> 9443/TCP 20m node-maintenance-operator-service ClusterIP 172.30.252.218 <none> 443/TCP 20m ssp-operator-service ClusterIP 172.30.30.108 <none> 9443/TCP 20m
Version-Release number of selected component (if applicable):
OCP >= 4.6
How reproducible:
100%
Steps to Reproduce:
1. deploy Openshift Virtualization 2. manually delete one of the replica of virt-operator, this will cause virt-operator deployment to report ready=1/2 3. the deployment controller will kick in to recover, this will causes a reconciliation loop in the OLM that will mark phase=Pending first on the CSV and then InstallSucceeded again at the end 4. we see `reusing existing cert hco-webhook-service-cert` twice in the logs and so we are sure that the OLM executed installCertRequirementsForDeployment twice and we know that this deletes and recreates the service (twice)
Actual results:
on node drains or where one of the replicas of one of the OLM managed operators got killed, all the services for OLM managed webhook from the same CSV are killed and recreated twice with a small service disruption
Expected results:
OLM is not going to kill the services for its managed webhooks when not needed
Additional info:
see: https://bugzilla.redhat.com/show_bug.cgi?id=2083396