Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16, 4.18
Component/s: Machine Config Operator
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In cpou upgrade scenario(from 4.16 to 4.18) with a paused infra mcp, the mco is degraded because it expects the controller version in the infra mc to be the 4.17 version.

cpou upgrade

Version-Release number of selected component (if applicable):

4.16 to 4.18 cpou upgrade

How reproducible:

Every time

Steps to Reproduce:

    1. Install a 4.16 cluster (in my test it is Azure with IPSEC)
    2. Install infra machinesets with 3 infra nodes, move some infra components to infra nodes like monitoring/ingress/registry
    3. Do cpou upgrade from 4.16 to 4.18
      3.1 Pause the worker and infra mcp
      3.2 Start the upgrade to 4.17

Actual results:

    The upgrade to 4.17 failed because the mco is degraded.

Expected results:

    The master nodes are upgraded to 4.17. The worker and infra nodes should stay with 4.16 because they are paused.

Additional info:

    In a test without infra mcp, the cpou upgrade works well.
    OTA functional qe has test case with a customer mcp, with worker lable, the cpou upgrade works well.
    In my failed test, the infra mcp does not have worker label, it only has a infra label.

Failed test job

oc adm upgrade status showed that one operator is degraded

= Control Plane =
Assessment:      Stalled
Target Version:  4.17.0-0.nightly-2024-11-21-052346 (from 4.16.23)
Completion:      97% (32 operators updated, 1 updating, 0 waiting)
Duration:        4h4m (Est. Time Remaining: N/A; estimate duration was 1h35m)
Operator Status: 32 Healthy, 1 Available but degraded

The degraded operator is mco. It stuck because it expects the 4.17 version of infra mc. However, infra mcp are paused thus will not upgrade to 4.17.

= Update Health =
Message: Cluster Operator machine-config is degraded (RequiredPoolsFailed)
  Since:       58m9s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: machine-config
  Description: Unable to apply 4.17.0-0.nightly-2024-11-21-052346: error during syncRequiredMachineConfigPools: [context deadline exceeded, MachineConfigPool infra has not progressed to latest configuration: controller version mismatch for rendered-infra-6c171d9d397c09f3d4b0b81d46df2c05 expected 39e1cd3c3b04229c48988be1fb7f99b95856aff3 has 4bb3364914c4dbcdfcc08b0914f402cdd38f014f: <unknown>, retrying]
Message: Cluster Version version is failing to proceed with the update (ClusterOperatorDegraded)
  Since:       3m58s
  Level:       Warning
  Impact:      Update Stalled
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusterversions.config.openshift.io: version
  Description: Cluster operator machine-config is degraded
Message: Outdated nodes in a paused pool 'infra' will not be updated
  Since:       -
  Level:       Warning
  Impact:      Update Stalled
  Reference:   https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-operator-issues.html#troubleshooting-disabling-autoreboot-mco_troubleshooting-operator-issues
  Resources:
    machineconfigpools.machineconfiguration.openshift.io: infra
  Description: Pool is paused, which stops all changes to the nodes in the pool, including updates. The nodes will not be updated until the pool is unpaused by the administrator.
Message: Outdated nodes in a paused pool 'worker' will not be updated
  Since:       -
  Level:       Warning
  Impact:      Update Stalled
  Reference:   https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-operator-issues.html#troubleshooting-disabling-autoreboot-mco_troubleshooting-operator-issues
  Resources:
    machineconfigpools.machineconfiguration.openshift.io: worker
  Description: Pool is paused, which stops all changes to the nodes in the pool, including updates. The nodes will not be updated until the pool is unpaused by the administrator.

The infra mc

rendered-infra-37d5ea50ae2274a6829c836c74ef0ca7    39e1cd3c3b04229c48988be1fb7f99b95856aff3   3.4.0             3h15m
rendered-infra-6c171d9d397c09f3d4b0b81d46df2c05    4bb3364914c4dbcdfcc08b0914f402cdd38f014f   3.4.0             5h2m

An possible workaround for customer to do the cpou upgrade with infra mcp:

I did a test with only worker mcp paused, infra mcp NOT paused. The infra mcp can be upgraded together with master mcp. And finally the cpou upgrade job was successful.

is blocked by

MCO-1459 Impact statement request for OCPBUGS-45045 cpou upgrade with infra mcp paused failed as mco expects a newer version of infra mc

Closed

Assignee:: Yu Qi Zhang

Reporter:: Qiujie Li

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/11/26 10:29 AM

Updated:: 2025/07/18 1:31 PM

Resolved:: 2024/12/10 9:59 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates