Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: 4.15.0
Affects Version/s: 4.14
Component/s: Machine Config Operator
Labels:
- pre-merge-tested
- qe-ocb-test

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No
Epic Link:
On Cluster Layering Tech Preview

Target Backport Versions:
None
Target Version:

4.15.0
Release Blocker:
None
Sprint:
MCO Sprint 242, MCO Sprint 243, MCO Sprint 244
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In pools with On-Cluster Build enabled. When a config drift happens because a file's content has been manually changed the MCP goes degraded (this is expected).

  - lastTransitionTime: "2023-08-31T11:34:33Z"
    message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting:
      "unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................:
      content mismatch for file \"/etc/mco-test-file\""'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded


If we fix this drift and we restore the original file's content, the MCP becomes degraded with this message:

    - lastTransitionTime: "2023-08-31T12:24:47Z"
      message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is
        reporting: "failed to update OS to quay.io/xxx/xxx@sha256:.......
        : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........:
        error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n:
        exit status 1"'
      reason: 1 nodes are reporting degraded status on sync
      status: "True"
      type: NodeDegraded

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-08-30-191617   True        False         4h18m   Error while reconciling 4.14.0-0.nightly-2023-08-30-191617: the cluster operator monitoring is not available

How reproducible:

Always

Steps to Reproduce:

1. Enable the OCB functionality for worker pool
$ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled=

(Create the necessary cms and secrets for the OCB functionality to work fine)

wait until the new image is created and the nodes are updated

2. Create a MC to deploy a new file
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: mco-drift-test-file
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,MCO%20test%20file%0A
        path: /etc/mco-test-file

wait until the new MC is deployed

3. Modify the content of the file /etc/mco-test-file making a backup first

$ oc debug  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}")
chrWarning: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
Starting pod/sregidor-sr2-2gb5z-worker-a-q7wcbcopenshift-qeinternal-debug-sv85v ...
To use host binaries, run `chroot /host`
oot /host
cd /etc
Pod IP: 10.0.128.9
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# cd /etc
sh-5.1# cat mco-test-file 
MCO test file
sh-5.1# cp mco-test-file mco-test-file-back
sh-5.1# echo -n "1" >> mco-test-file


4. wait until the MCP reports the config drift issue

$ oc get mcp worker -o yaml
....
  - lastTransitionTime: "2023-08-31T11:34:33Z"
    message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting:
      "unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................:
      content mismatch for file \"/etc/mco-test-file\""'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded


5. Restore the backup that we made in step 3
sh-5.1# cp mco-test-file-back mco-test-file

Actual results:

The worker pool is degraded with this message

    - lastTransitionTime: "2023-08-31T12:24:47Z"
      message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is
        reporting: "failed to update OS to quay.io/xxx/xxx@sha256:.......
        : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........:
        error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n:
        exit status 1"'
      reason: 1 nodes are reporting degraded status on sync
      status: "True"
      type: NodeDegraded

Expected results:

The node pool should stop being degraded.

Additional info:

There is a link to the must-gather file in the first comment of this issue.

links to

openshift/machine-config-operator#3946: OCPBUGS-18456,OCPBUGS-18458,OCPBUGS-18414: Configures SSH Keys and Password for core, fixes config drift degradation

Assignee:: Dalia Khater

Reporter:: Sergio Regidor de la Rosa

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/08/31 3:02 PM

Updated:: 2025/07/25 5:39 PM

Resolved:: 2023/10/31 7:52 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide