Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.11
Component/s: OLM
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

this follows up on https://bugzilla.redhat.com/show_bug.cgi?id=2083396 after we fully identified its root cause.

OK, this is fully reproducible and it's basically a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2048441

Deleting the passive virt-operator pod causes it deployment to flicker a bit:

oc delete pods -n openshift-cnv virt-operator-6f9c546f6d-4z7j7


stirabos@t14s:~$ oc get deployment -n openshift-cnv virt-operator --watch
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
virt-operator   2/2     2            2           69m
virt-operator   1/2     1            1           73m
virt-operator   1/2     2            1           73m
virt-operator   1/2     2            1           73m
virt-operator   1/2     2            1           73m
virt-operator   1/2     2            1           73m
virt-operator   1/2     2            1           73m
virt-operator   1/2     2            1           73m
virt-operator   1/2     2            1           73m
virt-operator   1/2     2            1           73m
virt-operator   1/2     2            1           73m
virt-operator   2/2     2            2           74m

and this is clearly detected by the OLM:
stirabos@t14s:~$ oc logs --follow -n openshift-operator-lifecycle-manager  $(oc get pods -n openshift-operator-lifecycle-manager -lapp=olm-operator -o name)
...
time="2022-09-02T14:11:57Z" level=warning msg="unhealthy component: waiting for deployment virt-operator to become ready: deployment \"virt-operator\" not available: Deployment does not have minimum availability." csv=kubevirt-hyperconverged-operator.v4.10.4 id=vvjS6 namespace=openshift-cnv phase=Succeeded strategy=deployment
...
I0902 14:12:00.197421       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.4", UID:"d6f3c907-7df5-4c56-ab08-75e5e504d869", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"135920", FieldPath:""}): type: 'Normal' reason: 'NeedsReinstall' installing: waiting for deployment virt-operator to become ready: deployment "virt-operator" not available: Deployment does not have minimum availability.
...
time="2022-09-02T14:12:00Z" level=info msg="scheduling ClusterServiceVersion for install" csv=kubevirt-hyperconverged-operator.v4.10.4 id=eb1Cv namespace=openshift-cnv phase=Pending
I0902 14:12:00.927419       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.4", UID:"d6f3c907-7df5-4c56-ab08-75e5e504d869", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"135971", FieldPath:""}): type: 'Normal' reason: 'AllRequirementsMet' all requirements found, attempting install
time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to"
time="2022-09-02T14:12:01Z" level=warning msg="reusing existing cert hco-webhook-service-cert"
time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to"
time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to"
time="2022-09-02T14:12:01Z" level=info msg="No api or webhook descs to add CA to"
...
I0902 14:12:07.150972       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.4", UID:"d6f3c907-7df5-4c56-ab08-75e5e504d869", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"136025", FieldPath:""}): type: 'Normal' reason: 'InstallSucceeded' waiting for install components to report healthy
time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to"
time="2022-09-02T14:12:07Z" level=warning msg="reusing existing cert hco-webhook-service-cert"
time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to"
time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to"
time="2022-09-02T14:12:07Z" level=info msg="No api or webhook descs to add CA to"

Please notice that, as for https://bugzilla.redhat.com/show_bug.cgi?id=2048441#c24 , each time the OLM will try to reconcile an existing service it will delete and recreate it.
See: https://github.com/operator-framework/operator-lifecycle-manager/blob/0fa6f2930dfd00c43e8e99c821f73f392da26378/pkg/controller/install/certresources.go#L255-L274

So:
1. deleting one of the replica of virt-operator will cause virt-operator deployment to have ready=1/2
2. this should be handled by the deployment controller, but still it causes a reconciliation loop in the OLM that will mark phase=Pending first on the CSV and then InstallSucceeded again at the end.
3. we see `reusing existing cert hco-webhook-service-cert` twice in the logs and so we are sure that the OLM executed installCertRequirementsForDeployment twice and we know that this deletes and recreates the service (twice)

Please notice that this is true for all the OLM managed services and we have 4 of them:
stirabos@t14s:~$ oc get services -n openshift-cnv -l=operators.coreos.com/kubevirt-hyperconverged.openshift-cnv
NAME                                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
hco-webhook-service                     ClusterIP   172.30.214.154   <none>        4343/TCP   20m
hostpath-provisioner-operator-service   ClusterIP   172.30.76.194    <none>        9443/TCP   20m
node-maintenance-operator-service       ClusterIP   172.30.252.218   <none>        443/TCP    20m
ssp-operator-service                    ClusterIP   172.30.30.108    <none>        9443/TCP   20m

Version-Release number of selected component (if applicable):

OCP >= 4.6

How reproducible:

100%

Steps to Reproduce:

1. deploy Openshift Virtualization 
2. manually delete one of the replica of virt-operator, this will cause virt-operator deployment to report ready=1/2
3. the deployment controller will kick in to recover, this will causes a reconciliation loop in the OLM that will mark phase=Pending first on the CSV and then InstallSucceeded again at the end
4. we see `reusing existing cert hco-webhook-service-cert` twice in the logs and so we are sure that the OLM executed installCertRequirementsForDeployment twice and we know that this deletes and recreates the service (twice)

Actual results:

on node drains or where one of the replicas of one of the OLM managed operators got killed, all the services for OLM managed webhook from the same CSV are killed and recreated twice with a small service disruption

Expected results:

OLM is not going to kill the services for its managed webhooks when not needed

Additional info:

see: https://bugzilla.redhat.com/show_bug.cgi?id=2083396

Assignee:: Per Goncalves da Silva

Reporter:: Simone Tiraboschi

Need Info From:: None

Contributors:: None

QA Contact:: Kui Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/10/05 11:05 AM

Updated:: 2025/07/29 5:36 AM

Resolved:: 2022/10/28 5:19 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide