-
Story
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
---
-
-
-
0
-
0
-
Rejected
Description of problem:
A long-lived cluster updating into 4.16.0-ec.1 was bitten by the Engineering Candidate's month-or-more-old api-int CA rotation (details on early rotation in API-1687). After manually updating /var/lib/kubelet/kubeconfig to include the new CA (which OCPBUGS-25821 is working on automating), multus pods still complained about untrusted api-int:
$ oc -n openshift-multus logs multus-pz7zp | grep api-int | tail -n5 E0119 19:33:52.983918 3194 reflector.go:148] k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dbuild0-gstfj-m-2.c.openshift-ci-build-farm.internal&resourceVersion=4723865081": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:33:55Z [error] Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:33:55Z [verbose] ADD finished CNI request ContainerID:"b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62" Netns:"/var/run/netns/36923fe0-e28d-422f-8213-233086527baa" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-machine-api;K8S_POD_NAME=cluster-autoscaler-default-f8dd547c7-dg9t5;K8S_POD_INFRA_CONTAINER_ID=b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62;K8S_POD_UID=f79ff01a-71c2-4f02-b48b-8c23c9e875ce" Path:"", result: "", err: error configuring pod [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5] networking: Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:34:00Z [error] Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:34:00Z [verbose] ADD finished CNI request ContainerID:"cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb" Netns:"/var/run/netns/bc7fbf17-c049-4241-a7dc-7e27acd3c8af" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-storage-version-migrator;K8S_POD_NAME=migrator-558d4d48b9-ggjpj;K8S_POD_INFRA_CONTAINER_ID=cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb;K8S_POD_UID=769153af-350b-492b-9589-ede2574aea85" Path:"", result: "", err: error configuring pod [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj] networking: Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
The multus pod needed a delete/replace, and after that it recovered:
$ oc --as system:admin -n openshift-multus delete pod multus-pz7zp pod "multus-pz7zp" deleted $ oc -n openshift-multus get -o wide pods | grep 'NAME\|build0-gstfj-m-2.c.openshift-ci-build-farm.internal' NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES multus-additional-cni-plugins-wrdtt 1/1 Running 1 28h 10.0.0.3 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> multus-admission-controller-74d794678b-9s7kl 2/2 Running 0 27h 10.129.0.36 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> multus-hxmkz 1/1 Running 0 11s 10.0.0.3 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> network-metrics-daemon-dczvs 2/2 Running 2 28h 10.129.0.4 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> $ oc -n openshift-multus logs multus-hxmkz | grep -c api-int 0
That need for multus-pod deletion should be automated, to reduce the number of things that need manual touches when the api-int CA rolls.
Version-Release number of selected component
Seen in 4.16.0-ec.1.
How reproducible:
Several multus on this cluster were bit. But others were not, including some on clusters with old kubeconfigs that did not contain the new CA. I'm not clear on what the trigger is, perhaps some clients escape immediate trouble by having exsting api-int connections to servers from back when the servers used the old CA? But deleting the multus pod on a cluster whose /var/lib/kubelet/kubeconfig has not yet been updated will likely reproduce the breakage, at least until OCPBUGS-25821 is fixed.
Steps to Reproduce:
Not entirely clear, but something like:
- Install 4.16.0-ec.1.
- Wait a month or more for the Kube API server operator to decide to roll the CA signing api-int.
- Delete a multus pod, so the replacement comes up broken on api-int trust.
- Manually update /var/lib/kubelet/kubeconfig.
Actual results:
Multus still fails to trust api-int until the broken pod is deleted or the container otherwise restarts to notice the updated kubeconfig.
Expected results:
Multus pod automatically pulls in the updated kubeconfig.
Additional info:
One possible implementation would be a liveness probe failing on api-int trust issues, triggering the kubelet to roll the multus container, and the replacement multus container to come up and load the fresh kubeconfig.
- clones
-
OCPBUGS-27429 Handle kubeconfig changes like CA rotation
- Closed
- relates to
-
OCPBUGS-28742 ovnkube-node doesn't refresh certificates after node was suspended for 30 days
- Closed