Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:Security

Test Coverage:

+
Severity:
Important
Regression:
No
Sprint:
SDN Sprint 256, SDN Sprint 257
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
Done
Target Version:

4.17.0
Escape Reason:
Escape Impact:
Corrective Measures:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Review Complete:
PX Technical Impact:
PX Technical Impact Notes:

Description of problem:

The ovnkube-controller CertificateStore fails to start when there is invalid current certificate.

Version-Release number of selected component (if applicable):

4.14.22

How reproducible:

Easily, Unknown how the zero-length certificate was created (Maybe failed rotation)

Steps to Reproduce:

1. Truncate current ovnkube-node-certs to zero bytes.

Actual results:

ovnkube-controller goes into crash loop forever.

Expected results:

ovnkube-controller should re-create the invalid certificate / not crash forever.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Log:

omc logs -n openshift-ovn-kubernetes ovnkube-node-jf2jv ovnkube-controller -p
2024-06-21T01:11:10.146874333Z F0621 01:11:10.146857  304586 ovnkube.go:136] failed to start the node certificate manager: failed to initialize the certificate manager: could not convert data from "/etc/ovn/ovnkube-node-certs/ovnkube-client-current.pem" into cert/key pair: tls: failed to find any PEM data in certificate input

omc get pods -n openshift-ovn-kubernetes -o wide
ovnkube-node-jf2jv                       7/8     CrashLoopBackOff   65         3h    10.x.xx.xx   xxx-xxx-xxx-master-0         <none>           <none>

-rw-------. 1 root root    0 Jun 15 03:16 ovnkube-client-2024-06-15-03-16-45.pem
lrwxrwxrwx. 1 root root   66 Jun 15 03:16 ovnkube-client-current.pem -> /etc/ovn/ovnkube-node-certs/ovnkube-client-2024-06-15-03-16-45.pem

The issue in this case was the current cert symlink pointed at an empty file, we are unsure why the rotation failed in this way however this took out the master node of a production FIS cluster we should recover from a filed startup of the cert-manager code, even if a simple as wiping all the certs and starting again instead of log.exit.

We should consider adding some recover code here:
https://github.com/openshift/ovn-kubernetes/blob/bdc67edd064afc3519c4780f43c1c3837cd4143f/go-controller/cmd/ovnkube/ovnkube.go#L290

https://github.com/openshift/ovn-kubernetes/blob/bdc67edd064afc3519c4780f43c1c3837cd4143f/go-controller/pkg/util/kube.go#L300

Also, --cert-duration=24h this might be a bit excessive on a production system. I may be naive but i don't see any advantage to rotating certs every single day.

is triggering

CORENET-966 Corrective Measure for OCPBUGS-36195: ovnkube-controller crash loop because of bad certificate

To Do

links to

Downstream merge

KCS - ovnkube-node pod is in CrashLoopBackOff state

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

Upstream fix

Assignee:: Patryk Diak

Reporter:: Tim Dawson

QA Contact:: Huiran Wang

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/06/26 5:23 AM

Updated:: 2025/03/11 9:13 PM

Resolved:: 2024/10/01 5:39 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates