-
Bug
-
Resolution: Unresolved
-
Undefined
-
4.18.0
-
Important
-
No
-
5
-
Rejected
-
False
-
-
Release Note Not Required
-
In Progress
deads reported in this thread that the static pod controller appears to sometimes deploy pods that do not show up in a reasonable timeframe, which occasionally triggers this test to fail (source job):
[sig-node] static pods should start after being created { static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 6 on node: "ip-10-0-56-130.us-east-2.compute.internal" didn't show up, waited: 3m30s}
David suspects that this actually happens far more often than the test failures indicate, however this test should be a good resource to find affected runs.
Test details indicates this fails up to 10% of the time on some job variants. The most common component affected appears to be kube-controller-manager, but apiserver and etcd are both appearing at times. Use the test details link if looking for more job runs.
Slack thread has more details from both deads@redhat.com and tjungblu@redhat.com.
Suspicion is that fixing this could improve install times and reliability.
–
jpoulin's notes. This issues is seen much more prevalently in the single-node jobs, for some reason. When combined with the cloned bug related to etcd, this is seen for close to 30% of single node upgrade runs.
Digging deeper into the logs, I see:
I1018 19:11:06.671321 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"8e24d779-45f6-467c-b730-6925dc516f21", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready" to "RevisionControllerDegraded: Internal error occurred: resource quota evaluation timed out\nMissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-controller-manager\" in namespace: \"openshift-kube-controller-manager\" for revision: 6 on node: \"ip-10-0-56-130.us-east-2.compute.internal\" didn't show up, waited: 3m30s\nNodeControllerDegraded: All master nodes are ready\nInstallerControllerDegraded: missing required resources: secrets: localhost-recovery-client-token-7",Progressing message changed from "NodeInstallerProgressing: 1 node is at revision 5; 0 nodes have achieved new revision 6" to "NodeInstallerProgressing: 1 node is at revision 5; 0 nodes have achieved new revision 7",Available message changed from "StaticPodsAvailable: 1 nodes are active; 1 node is at revision 5; 0 nodes have achieved new revision 6" to "StaticPodsAvailable: 1 nodes are active; 1 node is at revision 5; 0 nodes have achieved new revision 7"
I believe this line could be the cause of such a delay:
missing required resources: secrets: localhost-recovery-client-token-7"
Is this secret needed for the deployment of the static pod? If so, would it be possible to check if it is available before scheduling the pod?
I tried scanning openshift/kubernetes for localhost-recovery-client, since I expected it to be part of the manifest, but then I found myself out of my depth since I don't know if this is even the true source of the relevant manifest.
- relates to
-
OCPBUGS-36867 Static pod controller pods sometimes fail to start [etcd]
- ASSIGNED
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update