Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31384

api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update

    • +
    • Critical
    • Yes
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, clusters that were created before {product-title} 4.7 had several secrets of type `SecretTypeTLS`. Upon upgrading to {product-title} 4.16, these secrets are deleted and re-created with the type `kubernetes.io/tls`. This removal could cause a race condition and the contents of the secrets could be lost. With this release, the secret type change now happens automatically and clusters created before {product-title} 4.7 can upgrade to 4.16 without risking losing the contents of these secrets. (link:https://issues.redhat.com/browse/OCPBUGS-31384[*OCPBUGS-31384*])
      Show
      * Previously, clusters that were created before {product-title} 4.7 had several secrets of type `SecretTypeTLS`. Upon upgrading to {product-title} 4.16, these secrets are deleted and re-created with the type `kubernetes.io/tls`. This removal could cause a race condition and the contents of the secrets could be lost. With this release, the secret type change now happens automatically and clusters created before {product-title} 4.7 can upgrade to 4.16 without risking losing the contents of these secrets. (link: https://issues.redhat.com/browse/OCPBUGS-31384 [* OCPBUGS-31384 *])
    • Bug Fix
    • Done

      Description of problem:

      In a cluster updating from 4.5.11 through many intermediate versions to 4.14.17 and on to 4.15.3 (initiated 2024-03-18T07:33:11Z), multus pods are sad about api-int X.509:

      $ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver/core/events.yaml <hivei01ue1.inspect.local.5020316083985214391.gz | yaml2json | jq -r '[.items[] | select(.reason == "FailedCreatePodSandBox")][0].message'
      (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-928-ip-10-164-221-242.ec2.internal_openshift-kube-apiserver_9e87f20b-471a-447e-9679-edce26b4ef78_0(8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c): error adding pod openshift-kube-apiserver_installer-928-ip-10-164-221-242.ec2.internal to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c Netns:/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78 Path: StdinData:[REDACTED]} ContainerID:"8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c" Netns:"/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal] networking: Multus: [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal/9e87f20b-471a-447e-9679-edce26b4ef78]: error waiting for pod: Get "https://api-int.REDACTED:6443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-928-ip-10-164-221-242.ec2.internal?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      

      Version-Release number of selected component (if applicable)

      4.15.3, so we have 4.15.2's OCPBUGS-30304 but not 4.15.5's OCPBUGS-30237.

      How reproducible

      Seen in two clusters after updating from 4.14 to 4.15.3.

      Steps to Reproduce

      Unclear.

      Actual results

      Sad multus pods.

      Expected results

      Happy cluster.

      Additional info

      $ openssl s_client -showcerts -connect api-int.REDACTED:6443 < /dev/null
      ...
      Certificate chain
       0 s:CN = api-int.REDACTED
         i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
         a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
         v:NotBefore: Mar 25 19:35:55 2024 GMT; NotAfter: Apr 24 19:35:56 2024 GMT
      ...
       1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
         i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
         a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
         v:NotBefore: Mar 18 07:33:47 2024 GMT; NotAfter: Mar 16 07:33:48 2034 GMT
      ...
      

      So that's created seconds after the update was initiated. We have inspect logs for some namespaces, but they don't go back quite that far, because the machine-config roll at the end of the update into 4.15.3 rolled all the pods:

      $ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-6cbfdd467c-4ctq7/kube-apiserver-operator/kube-apiserver-operator/logs/current.log <hivei01ue1.inspect.local.5020316083985214391.gz | head -n2
      2024-03-18T08:22:05.058253904Z I0318 08:22:05.056255       1 cmd.go:241] Using service-serving-cert provided certificates
      2024-03-18T08:22:05.058253904Z I0318 08:22:05.056351       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
      

      We were able to recover individual nodes via:

      1. oc config new-kubelet-bootstrap-kubeconfig > bootstrap.kubeconfig  from any machine with an admin kubeconfig
      2. copy to all nodes as /etc/kubernetes/kubeconfig
      3. on each node rm /var/lib/kubelet/kubeconfig
      4. restart each node
      5. approve each kubelet CSR
      6. delete the node's multus-* pod.

            [OCPBUGS-31384] api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:0041

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0041

            Hi vrutkovs@redhat.com,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi vrutkovs@redhat.com , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            Ke Wang added a comment - - edited

            I repeated this another 3 times with 4.16, got the expected results, so moving it verified.

            Ke Wang added a comment - - edited I repeated this another 3 times with 4.16, got the expected results, so moving it verified.

            Verification looks good to me, thanks!

            Vadim Rutkovsky added a comment - Verification looks good to me, thanks!

              vrutkovs@redhat.com Vadim Rutkovsky
              trking W. Trevor King
              Ke Wang Ke Wang
              Votes:
              1 Vote for this issue
              Watchers:
              20 Start watching this issue

                Created:
                Updated:
                Resolved: