Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36195

ovnkube-controller crash loop because of bad certificate

XMLWordPrintable

    • 06/26 See the linked KB for the workaround

      Description of problem:

      The ovnkube-controller CertificateStore fails to start when there is invalid current certificate.

      Version-Release number of selected component (if applicable):

      4.14.22

      How reproducible:

      Easily, Unknown how the zero-length certificate was created (Maybe failed rotation)

      Steps to Reproduce:

      1. Truncate current ovnkube-node-certs to zero bytes.

      Actual results:

      ovnkube-controller goes into crash loop forever.

      Expected results:

      ovnkube-controller should re-create the invalid certificate / not crash forever.

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      Log:

      omc logs -n openshift-ovn-kubernetes ovnkube-node-jf2jv ovnkube-controller -p
      2024-06-21T01:11:10.146874333Z F0621 01:11:10.146857  304586 ovnkube.go:136] failed to start the node certificate manager: failed to initialize the certificate manager: could not convert data from "/etc/ovn/ovnkube-node-certs/ovnkube-client-current.pem" into cert/key pair: tls: failed to find any PEM data in certificate input
      
      omc get pods -n openshift-ovn-kubernetes -o wide
      ovnkube-node-jf2jv                       7/8     CrashLoopBackOff   65         3h    10.x.xx.xx   xxx-xxx-xxx-master-0         <none>           <none>
      
      -rw-------. 1 root root    0 Jun 15 03:16 ovnkube-client-2024-06-15-03-16-45.pem
      lrwxrwxrwx. 1 root root   66 Jun 15 03:16 ovnkube-client-current.pem -> /etc/ovn/ovnkube-node-certs/ovnkube-client-2024-06-15-03-16-45.pem
      
      

      The issue in this case was the current cert symlink pointed at an empty file, we are unsure why the rotation failed in this way however this took out the master node of a production FIS cluster we should recover from a filed startup of the cert-manager code, even if a simple as wiping all the certs and starting again instead of log.exit.

      We should consider adding some recover code here:
      https://github.com/openshift/ovn-kubernetes/blob/bdc67edd064afc3519c4780f43c1c3837cd4143f/go-controller/cmd/ovnkube/ovnkube.go#L290

      https://github.com/openshift/ovn-kubernetes/blob/bdc67edd064afc3519c4780f43c1c3837cd4143f/go-controller/pkg/util/kube.go#L300

      Also, --cert-duration=24h this might be a bit excessive on a production system. I may be naive but i don't see any advantage to rotating certs every single day.

              pdiak@redhat.com Patryk Diak
              rhn-support-tidawson Timothy Dawson
              Huiran Wang Huiran Wang
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: