Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.17.z
Affects Version/s: 4.13
Component/s: Machine Config Operator
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.17.0
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-37534~~. The following is the description of the original issue:
—
Description of problem:

Prow jobs upgrading from 4.9 to 4.16 are failing when they upgrade from 4.12 to 4.13.

Nodes become NotReady when MCO tries to apply the new 4.13 configuration to the MCPs.

The failing job is: periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.9-azure-ipi-f28

We have reproduced the issue and we found an ordering cycle error in the journal log

Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 systemd-journald.service[838]: Runtime Journal (/run/log/journal/960b04f10e4f44d98453ce5faae27e84) is 8.0M, max 641.9M, 633.9M free.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found ordering cycle on network-online.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on node-valid-hostname.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on ovs-configuration.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on firstboot-osupdate.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-firstboot.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Job network-online.target/start deleted to break ordering cycle starting with machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: Queued start job for default target Graphical Interface.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: (This warning is only shown for the first unit using IP firewalling.)
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: Deactivated successfully.

Version-Release number of selected component (if applicable):

    Using IPI on Azure, these are the version involved in the current issue upgrading from 4.9 to 4.13:
    
      version: 4.13.0-0.nightly-2024-07-23-154444
      version: 4.12.0-0.nightly-2024-07-23-230744
      version: 4.11.59
      version: 4.10.67
      version: 4.9.59

How reproducible:

    Always

Steps to Reproduce:

    1. Upgrade an IPI on Azure cluster from 4.9 to 4.13. Theoretically, upgrading from 4.12 to 4.13 should be enough, but we reproduced it following the whole path.

Actual results:


    Nodes become not ready
$ oc get nodes
NAME                                                 STATUS                        ROLES    AGE     VERSION
ci-op-g94jvswm-cc71e-998q8-master-0                  Ready                         master   6h14m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-1                  Ready                         master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-2                  NotReady,SchedulingDisabled   master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus1-c7ngb   NotReady,SchedulingDisabled   worker   6h2m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus2-2ppf6   Ready                         worker   6h4m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus3-nqshj   Ready                         worker   6h6m    v1.25.16+306a47e

And in the NotReady nodes we can see the ordering cycle error mentioned in the description of this ticket.

Expected results:

No ordering cycle error should happen and the upgrade should be executed without problems.

Additional info:

clones

OCPBUGS-37534 4.12 -> 4.13 upgrade using IPI on Azure does not work

Closed

is blocked by

OCPBUGS-37534 4.12 -> 4.13 upgrade using IPI on Azure does not work

Closed

links to

openshift/machine-config-operator#4529: [release-4.17] OCPBUGS-38374: Revert "MCD-pull: run after network-online.target in Azure"

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

Assignee:: Team MCO

Reporter:: OpenShift Prow Bot

QA Contact:: Sergio Regidor de la Rosa

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/08/12 6:43 PM

Updated:: 2025/07/21 11:41 PM

Resolved:: 2024/10/01 5:41 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates