-
Bug
-
Resolution: Done
-
Undefined
-
4.14
-
Moderate
-
No
-
MCO Sprint 242, MCO Sprint 243, MCO Sprint 244
-
3
-
False
-
Description of problem:
In pools with On-Cluster Build enabled. When a config drift happens because a file's content has been manually changed the MCP goes degraded (this is expected). - lastTransitionTime: "2023-08-31T11:34:33Z" message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting: "unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................: content mismatch for file \"/etc/mco-test-file\""' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded If we fix this drift and we restore the original file's content, the MCP becomes degraded with this message: - lastTransitionTime: "2023-08-31T12:24:47Z" message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is reporting: "failed to update OS to quay.io/xxx/xxx@sha256:....... : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........: error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n: exit status 1"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-08-30-191617 True False 4h18m Error while reconciling 4.14.0-0.nightly-2023-08-30-191617: the cluster operator monitoring is not available
How reproducible:
Always
Steps to Reproduce:
1. Enable the OCB functionality for worker pool $ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled= (Create the necessary cms and secrets for the OCB functionality to work fine) wait until the new image is created and the nodes are updated 2. Create a MC to deploy a new file apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: mco-drift-test-file spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:,MCO%20test%20file%0A path: /etc/mco-test-file wait until the new MC is deployed 3. Modify the content of the file /etc/mco-test-file making a backup first $ oc debug node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") chrWarning: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters] Starting pod/sregidor-sr2-2gb5z-worker-a-q7wcbcopenshift-qeinternal-debug-sv85v ... To use host binaries, run `chroot /host` oot /host cd /etc Pod IP: 10.0.128.9 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# cd /etc sh-5.1# cat mco-test-file MCO test file sh-5.1# cp mco-test-file mco-test-file-back sh-5.1# echo -n "1" >> mco-test-file 4. wait until the MCP reports the config drift issue $ oc get mcp worker -o yaml .... - lastTransitionTime: "2023-08-31T11:34:33Z" message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting: "unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................: content mismatch for file \"/etc/mco-test-file\""' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded 5. Restore the backup that we made in step 3 sh-5.1# cp mco-test-file-back mco-test-file
Actual results:
The worker pool is degraded with this message - lastTransitionTime: "2023-08-31T12:24:47Z" message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is reporting: "failed to update OS to quay.io/xxx/xxx@sha256:....... : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........: error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n: exit status 1"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Expected results:
The node pool should stop being degraded.
Additional info:
There is a link to the must-gather file in the first comment of this issue.