Uploaded image for project: 'OpenShift SDN'
  1. OpenShift SDN
  2. SDN-4460

Investigate CA rotation for ovnkube-node certificates

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False
    • ---
    • 0
    • 0
    • Rejected

      Description of problem:

      A long-lived cluster updating into 4.16.0-ec.1 was bitten by the Engineering Candidate's month-or-more-old api-int CA rotation (details on early rotation in API-1687). After manually updating /var/lib/kubelet/kubeconfig to include the new CA (which OCPBUGS-25821 is working on automating), multus pods still complained about untrusted api-int:

      $ oc -n openshift-multus logs multus-pz7zp | grep api-int | tail -n5
      E0119 19:33:52.983918    3194 reflector.go:148] k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dbuild0-gstfj-m-2.c.openshift-ci-build-farm.internal&resourceVersion=4723865081": tls: failed to verify certificate: x509: certificate signed by unknown authority
      2024-01-19T19:33:55Z [error] Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      2024-01-19T19:33:55Z [verbose] ADD finished CNI request ContainerID:"b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62" Netns:"/var/run/netns/36923fe0-e28d-422f-8213-233086527baa" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-machine-api;K8S_POD_NAME=cluster-autoscaler-default-f8dd547c7-dg9t5;K8S_POD_INFRA_CONTAINER_ID=b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62;K8S_POD_UID=f79ff01a-71c2-4f02-b48b-8c23c9e875ce" Path:"", result: "", err: error configuring pod [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5] networking: Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      2024-01-19T19:34:00Z [error] Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      2024-01-19T19:34:00Z [verbose] ADD finished CNI request ContainerID:"cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb" Netns:"/var/run/netns/bc7fbf17-c049-4241-a7dc-7e27acd3c8af" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-storage-version-migrator;K8S_POD_NAME=migrator-558d4d48b9-ggjpj;K8S_POD_INFRA_CONTAINER_ID=cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb;K8S_POD_UID=769153af-350b-492b-9589-ede2574aea85" Path:"", result: "", err: error configuring pod [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj] networking: Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      

      The multus pod needed a delete/replace, and after that it recovered:

      $ oc --as system:admin -n openshift-multus delete pod multus-pz7zp
      pod "multus-pz7zp" deleted
      $ oc -n openshift-multus get -o wide pods | grep 'NAME\|build0-gstfj-m-2.c.openshift-ci-build-farm.internal'
      NAME                                           READY   STATUS              RESTARTS      AGE     IP               NODE                                                              NOMINATED NODE   READINESS GATES
      multus-additional-cni-plugins-wrdtt            1/1     Running             1             28h     10.0.0.3         build0-gstfj-m-2.c.openshift-ci-build-farm.internal               <none>           <none>
      multus-admission-controller-74d794678b-9s7kl   2/2     Running             0             27h     10.129.0.36      build0-gstfj-m-2.c.openshift-ci-build-farm.internal               <none>           <none>
      multus-hxmkz                                   1/1     Running             0             11s     10.0.0.3         build0-gstfj-m-2.c.openshift-ci-build-farm.internal               <none>           <none>
      network-metrics-daemon-dczvs                   2/2     Running             2             28h     10.129.0.4       build0-gstfj-m-2.c.openshift-ci-build-farm.internal               <none>           <none>
      $ oc -n openshift-multus logs multus-hxmkz | grep -c api-int
      0
      

      That need for multus-pod deletion should be automated, to reduce the number of things that need manual touches when the api-int CA rolls.

      Version-Release number of selected component

      Seen in 4.16.0-ec.1.

      How reproducible:

      Several multus on this cluster were bit. But others were not, including some on clusters with old kubeconfigs that did not contain the new CA. I'm not clear on what the trigger is, perhaps some clients escape immediate trouble by having exsting api-int connections to servers from back when the servers used the old CA? But deleting the multus pod on a cluster whose /var/lib/kubelet/kubeconfig has not yet been updated will likely reproduce the breakage, at least until OCPBUGS-25821 is fixed.

      Steps to Reproduce:

      Not entirely clear, but something like:

      1. Install 4.16.0-ec.1.
      2. Wait a month or more for the Kube API server operator to decide to roll the CA signing api-int.
      3. Delete a multus pod, so the replacement comes up broken on api-int trust.
      4. Manually update /var/lib/kubelet/kubeconfig.

      Actual results:

      Multus still fails to trust api-int until the broken pod is deleted or the container otherwise restarts to notice the updated kubeconfig.

      Expected results:

      Multus pod automatically pulls in the updated kubeconfig.

      Additional info:

      One possible implementation would be a liveness probe failing on api-int trust issues, triggering the kubelet to roll the multus container, and the replacement multus container to come up and load the fresh kubeconfig.

              pdiak@redhat.com Patryk Diak
              trking W. Trevor King
              Weibin Liang Weibin Liang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: