Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18
Component/s: kube-controller-manager
Labels:
- feature-team-5.0-upgrade

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None
Epic Link:
WorkloadsClusterOperatorStatus

Target Backport Versions:
None
Target Version:

4.22
Release Blocker:
None
Sprint:
5.0 Upgrade Sprint 1, 5.0 Upgrade Sprint 2
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run.

This card captures kube-controller-manager operator that blips Degraded=True during upgrade runs.

Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-azure-ovn-upgrade/1843275894844559360

Reasons associated with the blip: NodeController_MasterNodesReady and NodeController_MasterNodesReady::StaticPods_Error 

For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in.

See linked issue for more explanation on the effort.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Found a new reason, job example:

    : [Monitor:legacy-cvo-invariants][bz-kube-controller-manager] clusteroperator/kube-controller-manager should not change condition/Degraded expand_less2h17m38s{  2 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Oct 20 16:02:29.174 E clusteroperator/kube-controller-manager condition/Degraded reason/StaticPods_Error status/True StaticPodsDegraded: pod/kube-controller-manager-ip-10-0-106-117.us-east-2.compute.internal container "cluster-policy-controller" is waiting: ContainerCreating: \nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-106-117.us-east-2.compute.internal container "kube-controller-manager" is waiting: ContainerCreating: \nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-106-117.us-east-2.compute.internal container "kube-controller-manager-cert-syncer" is waiting: ContainerCreating: \nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-106-117.us-east-2.compute.internal container "kube-controller-manager-recovery-controller" is waiting: ContainerCreating:
Oct 20 16:02:29.174 - 93s   E clusteroperator/kube-controller-manager condition/Degraded reason/StaticPods_Error status/True StaticPodsDegraded: pod/kube-controller-manager-ip-10-0-106-117.us-east-2.compute.internal container "cluster-policy-controller" is waiting: ContainerCreating: \nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-106-117.us-east-2.compute.internal container "kube-controller-manager" is waiting: ContainerCreating: \nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-106-117.us-east-2.compute.internal container "kube-controller-manager-cert-syncer" is waiting: ContainerCreating: \nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-106-117.us-east-2.compute.internal container "kube-controller-manager-recovery-controller" is waiting: ContainerCreating:

10 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:

Oct 20 16:04:02.860 W clusteroperator/kube-controller-manager condition/Degraded reason/AsExpected status/False StaticPodsDegraded: pod/kube-controller-manager-ip-10-0-106-117.us-east-2.compute.internal container "kube-controller-manager" started at 2025-10-20 15:59:24 +0000 UTC is still not ready\nNodeControllerDegraded: All master nodes are ready (exception: Degraded=False is the happy case)
Oct 20 16:40:25.962 E clusteroperator/kube-controller-manager condition/Degraded reason/NodeController_MasterNodesReady status/True NodeControllerDegraded: The master nodes not ready: node "ip-10-0-106-117.us-east-2.compute.internal" not ready since 2025-10-20 16:40:04 +0000 UTC because KubeletNotReady (container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: no CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?) (exception: https://issues.redhat.com/browse/OCPBUGS-38662)
Oct 20 16:40:25.962 - 4s    E clusteroperator/kube-controller-manager condition/Degraded reason/NodeController_MasterNodesReady status/True NodeControllerDegraded: The master nodes not ready: node "ip-10-0-106-117.us-east-2.compute.internal" not ready since 2025-10-20 16:40:04 +0000 UTC because KubeletNotReady (container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: no CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?) (exception: https://issues.redhat.com/browse/OCPBUGS-38662)
Oct 20 16:40:30.645 W clusteroperator/kube-controller-manager condition/Degraded reason/AsExpected status/False NodeControllerDegraded: All master nodes are ready (exception: Degraded=False is the happy case)
Oct 20 16:47:52.366 E clusteroperator/kube-controller-manager condition/Degraded reason/NodeController_MasterNodesReady status/True NodeControllerDegraded: The master nodes not ready: node "ip-10-0-81-174.us-east-2.compute.internal" not ready since 2025-10-20 16:47:31 +0000 UTC because KubeletNotReady (container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: no CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?) (exception: https://issues.redhat.com/browse/OCPBUGS-38662)
Oct 20 16:47:52.366 - 5s    E clusteroperator/kube-controller-manager condition/Degraded reason/NodeController_MasterNodesReady status/True NodeControllerDegraded: The master nodes not ready: node "ip-10-0-81-174.us-east-2.compute.internal" not ready since 2025-10-20 16:47:31 +0000 UTC because KubeletNotReady (container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: no CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?) (exception: https://issues.redhat.com/browse/OCPBUGS-38662)
Oct 20 16:47:57.539 W clusteroperator/kube-controller-manager condition/Degraded reason/AsExpected status/False NodeControllerDegraded: All master nodes are ready (exception: Degraded=False is the happy case)
Oct 20 17:11:11.914 E clusteroperator/kube-controller-manager condition/Degraded reason/NodeController_MasterNodesReady status/True NodeControllerDegraded: The master nodes not ready: node "ip-10-0-36-166.us-east-2.compute.internal" not ready since 2025-10-20 17:10:56 +0000 UTC because KubeletNotReady (container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: no CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?) (exception: https://issues.redhat.com/browse/OCPBUGS-38662)
Oct 20 17:11:11.914 - 10s   E clusteroperator/kube-controller-manager condition/Degraded reason/NodeController_MasterNodesReady status/True NodeControllerDegraded: The master nodes not ready: node "ip-10-0-36-166.us-east-2.compute.internal" not ready since 2025-10-20 17:10:56 +0000 UTC because KubeletNotReady (container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: no CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?) (exception: https://issues.redhat.com/browse/OCPBUGS-38662)
Oct 20 17:11:22.264 W clusteroperator/kube-controller-manager condition/Degraded reason/AsExpected status/False NodeControllerDegraded: All master nodes are ready (exception: Degraded=False is the happy case)
}

relates to

TRT-1578 Ensure all HA components are not degraded by design during upgrades

links to

openshift/cluster-kube-scheduler-operator#606: OCPBUGS-38662: deps: Update library-go to improve GuardController

openshift/library-go#2081: OCPBUGS-38662: node controller: Don't degrade when node is upgrading

openshift/library-go#2093: OCPBUGS-38662: guard controller: Handle conflict gracefully

Assignee:: Ondřej Kupka

Reporter:: Ken Zhang

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/08/19 5:14 PM

Updated:: 2026/02/24 9:35 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates