-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.15
-
+
-
Critical
-
Yes
-
Proposed
-
False
-
-
-
Bug Fix
-
Done
-
-
-
-
Description of problem:
In a cluster updating from 4.5.11 through many intermediate versions to 4.14.17 and on to 4.15.3 (initiated 2024-03-18T07:33:11Z), multus pods are sad about api-int X.509:
$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver/core/events.yaml <hivei01ue1.inspect.local.5020316083985214391.gz | yaml2json | jq -r '[.items[] | select(.reason == "FailedCreatePodSandBox")][0].message' (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-928-ip-10-164-221-242.ec2.internal_openshift-kube-apiserver_9e87f20b-471a-447e-9679-edce26b4ef78_0(8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c): error adding pod openshift-kube-apiserver_installer-928-ip-10-164-221-242.ec2.internal to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c Netns:/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78 Path: StdinData:[REDACTED]} ContainerID:"8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c" Netns:"/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal] networking: Multus: [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal/9e87f20b-471a-447e-9679-edce26b4ef78]: error waiting for pod: Get "https://api-int.REDACTED:6443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-928-ip-10-164-221-242.ec2.internal?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
Version-Release number of selected component (if applicable)
4.15.3, so we have 4.15.2's OCPBUGS-30304 but not 4.15.5's OCPBUGS-30237.
How reproducible
Seen in two clusters after updating from 4.14 to 4.15.3.
Steps to Reproduce
Unclear.
Actual results
Sad multus pods.
Expected results
Happy cluster.
Additional info
$ openssl s_client -showcerts -connect api-int.REDACTED:6443 < /dev/null ... Certificate chain 0 s:CN = api-int.REDACTED i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228 a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256 v:NotBefore: Mar 25 19:35:55 2024 GMT; NotAfter: Apr 24 19:35:56 2024 GMT ... 1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228 i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228 a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256 v:NotBefore: Mar 18 07:33:47 2024 GMT; NotAfter: Mar 16 07:33:48 2034 GMT ...
So that's created seconds after the update was initiated. We have inspect logs for some namespaces, but they don't go back quite that far, because the machine-config roll at the end of the update into 4.15.3 rolled all the pods:
$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-6cbfdd467c-4ctq7/kube-apiserver-operator/kube-apiserver-operator/logs/current.log <hivei01ue1.inspect.local.5020316083985214391.gz | head -n2 2024-03-18T08:22:05.058253904Z I0318 08:22:05.056255 1 cmd.go:241] Using service-serving-cert provided certificates 2024-03-18T08:22:05.058253904Z I0318 08:22:05.056351 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
We were able to recover individual nodes via:
- oc config new-kubelet-bootstrap-kubeconfig > bootstrap.kubeconfig from any machine with an admin kubeconfig
- copy to all nodes as /etc/kubernetes/kubeconfig
- on each node rm /var/lib/kubelet/kubeconfig
- restart each node
- approve each kubelet CSR
- delete the node's multus-* pod.
- blocks
-
OCPBUGS-31807 api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update
- Closed
- is cloned by
-
OCPBUGS-31807 api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update
- Closed
- relates to
-
OCPBUGS-30304 cert-syncer is forcibly changing secret type without retaining content
- Closed
-
API-1687 Impact cert issues after 4.14 to 4.15 upgrade
- Review
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update