Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: 4.18.0
Affects Version/s: 4.18.0
Component/s: kube-controller-manager
Labels:
- edge-payload

Severity:
Important
Regression:
No
Story Points:
5
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
Target Version:

4.18.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

deads reported in this thread that the static pod controller appears to sometimes deploy pods that do not show up in a reasonable timeframe, which occasionally triggers this test to fail (source job):

[sig-node] static pods should start after being created 

{  static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 6 on node: "ip-10-0-56-130.us-east-2.compute.internal" didn't show up, waited: 3m30s}

David suspects that this actually happens far more often than the test failures indicate, however this test should be a good resource to find affected runs.

Test details indicates this fails up to 10% of the time on some job variants. The most common component affected appears to be kube-controller-manager, but apiserver and etcd are both appearing at times. Use the test details link if looking for more job runs.

Slack thread has more details from both deads@redhat.com and tjungblu@redhat.com.

Suspicion is that fixing this could improve install times and reliability.
–
jpoulin's notes. This issues is seen much more prevalently in the single-node jobs, for some reason. When combined with the cloned bug related to etcd, this is seen for close to 30% of single node upgrade runs.

Digging deeper into the logs, I see:

I1018 19:11:06.671321       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"8e24d779-45f6-467c-b730-6925dc516f21", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready" to "RevisionControllerDegraded: Internal error occurred: resource quota evaluation timed out\nMissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-controller-manager\" in namespace: \"openshift-kube-controller-manager\" for revision: 6 on node: \"ip-10-0-56-130.us-east-2.compute.internal\" didn't show up, waited: 3m30s\nNodeControllerDegraded: All master nodes are ready\nInstallerControllerDegraded: missing required resources: secrets: localhost-recovery-client-token-7",Progressing message changed from "NodeInstallerProgressing: 1 node is at revision 5; 0 nodes have achieved new revision 6" to "NodeInstallerProgressing: 1 node is at revision 5; 0 nodes have achieved new revision 7",Available message changed from "StaticPodsAvailable: 1 nodes are active; 1 node is at revision 5; 0 nodes have achieved new revision 6" to "StaticPodsAvailable: 1 nodes are active; 1 node is at revision 5; 0 nodes have achieved new revision 7"

I believe this line could be the cause of such a delay:

missing required resources: secrets: localhost-recovery-client-token-7"

Is this secret needed for the deployment of the static pod? If so, would it be possible to check if it is available before scheduling the pod?
I tried scanning openshift/kubernetes for localhost-recovery-client, since I expected it to be part of the manifest, but then I found myself out of my depth since I don't know if this is even the true source of the relevant manifest.

relates to

OCPBUGS-36867 Static pod controller pods sometimes fail to start [etcd]

ASSIGNED

links to

API-1835: update the revision controller to use the type moved latestrevision

RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update

Assignee:: Filip Krepinsky

Reporter:: Jeremy Poulin

QA Contact:: Ying Zhou

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/10/21 6:29 PM

Updated:: 2025/02/25 4:47 AM

Resolved:: 2025/02/25 4:47 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates