Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.13.0
Component/s: service-ca
Labels:
None

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.13.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

the service ca controller start func seems to return that error as soon as its context is cancelled (which seems to happen the moment the first signal is received): https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f558246b8025584056/pkg/controller/starter.go#L24

that apparently triggers os.Exit(1) immediately https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f55824[…]om/openshift/library-go/pkg/controller/controllercmd/builder.go

the lock release doesn't happen until the periodic renew tick breaks out https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f55824[…]/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go

seems unlikely that you'd reach the call to le.release() before the call to os.Exit(1) in the other goroutine

Version-Release number of selected component (if applicable):

4.13.0

How reproducible:

~always

Steps to Reproduce:

1. oc delete -n openshift-service-ca pod <service-ca pod>

Actual results:

the old pod logs show:

W1103 09:59:14.370594       1 builder.go:106] graceful termination failed, controllers failed with error: stopped

and when a new pod comes up to replace it, it has to wait for a while before acquiring the leader lock

I1103 16:46:00.166173       1 leaderelection.go:248] attempting to acquire leader lease openshift-service-ca/service-ca-controller-lock...
 .... waiting ....
I1103 16:48:30.004187       1 leaderelection.go:258] successfully acquired lease openshift-service-ca/service-ca-controller-lock

Expected results:

new pod can acquire the leader lease without waiting for the old pod's lease to expire

Additional info:

blocks

OCPBUGS-3279 Service-ca controller exits immediately with an error on sigterm

Closed

causes

USHIFT-542 On reboot the service-ca and OpenShift router takes a very long time >4-5min to start again.

Closed

is cloned by

OCPBUGS-3279 Service-ca controller exits immediately with an error on sigterm

Closed

links to

openshift/service-ca-operator#202: OCPBUGS-3195: Return nil from start funcs after context is cancelled.

Assignee:: Stanislav Láznička (Inactive)

Reporter:: Ben Luddy

QA Contact:: Giriyamma Karagere Ramaswamy (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/11/03 4:40 PM

Updated:: 2023/05/17 10:32 PM

Resolved:: 2023/05/17 10:32 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates