Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31807

api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update

    XMLWordPrintable

Details

    • Critical
    • Yes
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously, clusters created before OpenShift Container Platform 4.7 will have had signer keys for api-int endpoint updated unexpectedly when users upgraded to OpenShift Container Platform 4.15 due to the installer deleting the SecretTypeTLS and then recreating the secret with kubernetes.io/tls type. With this release, the issue is resolved by the installer changing the secret type without deleting the secret.
      Show
      Previously, clusters created before OpenShift Container Platform 4.7 will have had signer keys for api-int endpoint updated unexpectedly when users upgraded to OpenShift Container Platform 4.15 due to the installer deleting the SecretTypeTLS and then recreating the secret with kubernetes.io/tls type. With this release, the issue is resolved by the installer changing the secret type without deleting the secret.
    • Bug Fix
    • Done

    Description

      This is a clone of issue OCPBUGS-31384. The following is the description of the original issue:

      Description of problem:

      In a cluster updating from 4.5.11 through many intermediate versions to 4.14.17 and on to 4.15.3 (initiated 2024-03-18T07:33:11Z), multus pods are sad about api-int X.509:

      $ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver/core/events.yaml <hivei01ue1.inspect.local.5020316083985214391.gz | yaml2json | jq -r '[.items[] | select(.reason == "FailedCreatePodSandBox")][0].message'
      (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-928-ip-10-164-221-242.ec2.internal_openshift-kube-apiserver_9e87f20b-471a-447e-9679-edce26b4ef78_0(8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c): error adding pod openshift-kube-apiserver_installer-928-ip-10-164-221-242.ec2.internal to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c Netns:/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78 Path: StdinData:[REDACTED]} ContainerID:"8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c" Netns:"/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal] networking: Multus: [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal/9e87f20b-471a-447e-9679-edce26b4ef78]: error waiting for pod: Get "https://api-int.REDACTED:6443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-928-ip-10-164-221-242.ec2.internal?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      

      Version-Release number of selected component (if applicable)

      4.15.3, so we have 4.15.2's OCPBUGS-30304 but not 4.15.5's OCPBUGS-30237.

      How reproducible

      Seen in two clusters after updating from 4.14 to 4.15.3.

      Steps to Reproduce

      Unclear.

      Actual results

      Sad multus pods.

      Expected results

      Happy cluster.

      Additional info

      $ openssl s_client -showcerts -connect api-int.REDACTED:6443 < /dev/null
      ...
      Certificate chain
       0 s:CN = api-int.REDACTED
         i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
         a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
         v:NotBefore: Mar 25 19:35:55 2024 GMT; NotAfter: Apr 24 19:35:56 2024 GMT
      ...
       1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
         i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
         a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
         v:NotBefore: Mar 18 07:33:47 2024 GMT; NotAfter: Mar 16 07:33:48 2034 GMT
      ...
      

      So that's created seconds after the update was initiated. We have inspect logs for some namespaces, but they don't go back quite that far, because the machine-config roll at the end of the update into 4.15.3 rolled all the pods:

      $ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-6cbfdd467c-4ctq7/kube-apiserver-operator/kube-apiserver-operator/logs/current.log <hivei01ue1.inspect.local.5020316083985214391.gz | head -n2
      2024-03-18T08:22:05.058253904Z I0318 08:22:05.056255       1 cmd.go:241] Using service-serving-cert provided certificates
      2024-03-18T08:22:05.058253904Z I0318 08:22:05.056351       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
      

      We were able to recover individual nodes via:

      1. oc config new-kubelet-bootstrap-kubeconfig > bootstrap.kubeconfig  from any machine with an admin kubeconfig
      2. copy to all nodes as /etc/kubernetes/kubeconfig
      3. on each node rm /var/lib/kubelet/kubeconfig
      4. restart each node
      5. approve each kubelet CSR
      6. delete the node's multus-* pod.

      Attachments

        Issue Links

          Activity

            People

              vrutkovs@redhat.com Vadim Rutkovsky
              openshift-crt-jira-prow OpenShift Prow Bot
              Ke Wang Ke Wang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: