Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31384

api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update

XMLWordPrintable

    • +
    • Critical
    • Yes
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, clusters that were created before {product-title} 4.7 had several secrets of type `SecretTypeTLS`. Upon upgrading to {product-title} 4.16, these secrets are deleted and re-created with the type `kubernetes.io/tls`. This removal could cause a race condition and the contents of the secrets could be lost. With this release, the secret type change now happens automatically and clusters created before {product-title} 4.7 can upgrade to 4.16 without risking losing the contents of these secrets. (link:https://issues.redhat.com/browse/OCPBUGS-31384[*OCPBUGS-31384*])
      Show
      * Previously, clusters that were created before {product-title} 4.7 had several secrets of type `SecretTypeTLS`. Upon upgrading to {product-title} 4.16, these secrets are deleted and re-created with the type `kubernetes.io/tls`. This removal could cause a race condition and the contents of the secrets could be lost. With this release, the secret type change now happens automatically and clusters created before {product-title} 4.7 can upgrade to 4.16 without risking losing the contents of these secrets. (link: https://issues.redhat.com/browse/OCPBUGS-31384 [* OCPBUGS-31384 *])
    • Bug Fix
    • Done

      Description of problem:

      In a cluster updating from 4.5.11 through many intermediate versions to 4.14.17 and on to 4.15.3 (initiated 2024-03-18T07:33:11Z), multus pods are sad about api-int X.509:

      $ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver/core/events.yaml <hivei01ue1.inspect.local.5020316083985214391.gz | yaml2json | jq -r '[.items[] | select(.reason == "FailedCreatePodSandBox")][0].message'
      (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-928-ip-10-164-221-242.ec2.internal_openshift-kube-apiserver_9e87f20b-471a-447e-9679-edce26b4ef78_0(8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c): error adding pod openshift-kube-apiserver_installer-928-ip-10-164-221-242.ec2.internal to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c Netns:/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78 Path: StdinData:[REDACTED]} ContainerID:"8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c" Netns:"/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal] networking: Multus: [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal/9e87f20b-471a-447e-9679-edce26b4ef78]: error waiting for pod: Get "https://api-int.REDACTED:6443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-928-ip-10-164-221-242.ec2.internal?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      

      Version-Release number of selected component (if applicable)

      4.15.3, so we have 4.15.2's OCPBUGS-30304 but not 4.15.5's OCPBUGS-30237.

      How reproducible

      Seen in two clusters after updating from 4.14 to 4.15.3.

      Steps to Reproduce

      Unclear.

      Actual results

      Sad multus pods.

      Expected results

      Happy cluster.

      Additional info

      $ openssl s_client -showcerts -connect api-int.REDACTED:6443 < /dev/null
      ...
      Certificate chain
       0 s:CN = api-int.REDACTED
         i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
         a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
         v:NotBefore: Mar 25 19:35:55 2024 GMT; NotAfter: Apr 24 19:35:56 2024 GMT
      ...
       1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
         i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
         a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
         v:NotBefore: Mar 18 07:33:47 2024 GMT; NotAfter: Mar 16 07:33:48 2034 GMT
      ...
      

      So that's created seconds after the update was initiated. We have inspect logs for some namespaces, but they don't go back quite that far, because the machine-config roll at the end of the update into 4.15.3 rolled all the pods:

      $ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-6cbfdd467c-4ctq7/kube-apiserver-operator/kube-apiserver-operator/logs/current.log <hivei01ue1.inspect.local.5020316083985214391.gz | head -n2
      2024-03-18T08:22:05.058253904Z I0318 08:22:05.056255       1 cmd.go:241] Using service-serving-cert provided certificates
      2024-03-18T08:22:05.058253904Z I0318 08:22:05.056351       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
      

      We were able to recover individual nodes via:

      1. oc config new-kubelet-bootstrap-kubeconfig > bootstrap.kubeconfig  from any machine with an admin kubeconfig
      2. copy to all nodes as /etc/kubernetes/kubeconfig
      3. on each node rm /var/lib/kubelet/kubeconfig
      4. restart each node
      5. approve each kubelet CSR
      6. delete the node's multus-* pod.

            vrutkovs@redhat.com Vadim Rutkovsky
            trking W. Trevor King
            Ke Wang Ke Wang
            Votes:
            1 Vote for this issue
            Watchers:
            20 Start watching this issue

              Created:
              Updated:
              Resolved: