-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.14
-
+
-
Important
-
No
-
SDN Sprint 256, SDN Sprint 257
-
2
-
False
-
-
N/A
-
Release Note Not Required
-
Done
-
-
-
-
-
-
06/26 See the linked KB for the workaround
-
-
Description of problem:
The ovnkube-controller CertificateStore fails to start when there is invalid current certificate.
Version-Release number of selected component (if applicable):
4.14.22
How reproducible:
Easily, Unknown how the zero-length certificate was created (Maybe failed rotation)
Steps to Reproduce:
1. Truncate current ovnkube-node-certs to zero bytes.
Actual results:
ovnkube-controller goes into crash loop forever.
Expected results:
ovnkube-controller should re-create the invalid certificate / not crash forever.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Log:
omc logs -n openshift-ovn-kubernetes ovnkube-node-jf2jv ovnkube-controller -p 2024-06-21T01:11:10.146874333Z F0621 01:11:10.146857 304586 ovnkube.go:136] failed to start the node certificate manager: failed to initialize the certificate manager: could not convert data from "/etc/ovn/ovnkube-node-certs/ovnkube-client-current.pem" into cert/key pair: tls: failed to find any PEM data in certificate input omc get pods -n openshift-ovn-kubernetes -o wide ovnkube-node-jf2jv 7/8 CrashLoopBackOff 65 3h 10.x.xx.xx xxx-xxx-xxx-master-0 <none> <none> -rw-------. 1 root root 0 Jun 15 03:16 ovnkube-client-2024-06-15-03-16-45.pem lrwxrwxrwx. 1 root root 66 Jun 15 03:16 ovnkube-client-current.pem -> /etc/ovn/ovnkube-node-certs/ovnkube-client-2024-06-15-03-16-45.pem
The issue in this case was the current cert symlink pointed at an empty file, we are unsure why the rotation failed in this way however this took out the master node of a production FIS cluster we should recover from a filed startup of the cert-manager code, even if a simple as wiping all the certs and starting again instead of log.exit.
We should consider adding some recover code here:
https://github.com/openshift/ovn-kubernetes/blob/bdc67edd064afc3519c4780f43c1c3837cd4143f/go-controller/cmd/ovnkube/ovnkube.go#L290
Also, --cert-duration=24h this might be a bit excessive on a production system. I may be naive but i don't see any advantage to rotating certs every single day.
- links to
-
RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update