-
Bug
-
Resolution: Done
-
Undefined
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
No
-
None
-
None
-
MCO Sprint 242, MCO Sprint 243, MCO Sprint 244
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
In pools with On-Cluster Build enabled. When a config drift happens because a file's content has been manually changed the MCP goes degraded (this is expected).
- lastTransitionTime: "2023-08-31T11:34:33Z"
message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting:
"unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................:
content mismatch for file \"/etc/mco-test-file\""'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
If we fix this drift and we restore the original file's content, the MCP becomes degraded with this message:
- lastTransitionTime: "2023-08-31T12:24:47Z"
message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is
reporting: "failed to update OS to quay.io/xxx/xxx@sha256:.......
: error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........:
error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n:
exit status 1"'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-08-30-191617 True False 4h18m Error while reconciling 4.14.0-0.nightly-2023-08-30-191617: the cluster operator monitoring is not available
How reproducible:
Always
Steps to Reproduce:
1. Enable the OCB functionality for worker pool
$ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled=
(Create the necessary cms and secrets for the OCB functionality to work fine)
wait until the new image is created and the nodes are updated
2. Create a MC to deploy a new file
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: mco-drift-test-file
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
source: data:,MCO%20test%20file%0A
path: /etc/mco-test-file
wait until the new MC is deployed
3. Modify the content of the file /etc/mco-test-file making a backup first
$ oc debug node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}")
chrWarning: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
Starting pod/sregidor-sr2-2gb5z-worker-a-q7wcbcopenshift-qeinternal-debug-sv85v ...
To use host binaries, run `chroot /host`
oot /host
cd /etc
Pod IP: 10.0.128.9
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# cd /etc
sh-5.1# cat mco-test-file
MCO test file
sh-5.1# cp mco-test-file mco-test-file-back
sh-5.1# echo -n "1" >> mco-test-file
4. wait until the MCP reports the config drift issue
$ oc get mcp worker -o yaml
....
- lastTransitionTime: "2023-08-31T11:34:33Z"
message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting:
"unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................:
content mismatch for file \"/etc/mco-test-file\""'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
5. Restore the backup that we made in step 3
sh-5.1# cp mco-test-file-back mco-test-file
Actual results:
The worker pool is degraded with this message
- lastTransitionTime: "2023-08-31T12:24:47Z"
message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is
reporting: "failed to update OS to quay.io/xxx/xxx@sha256:.......
: error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........:
error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n:
exit status 1"'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
Expected results:
The node pool should stop being degraded.
Additional info:
There is a link to the must-gather file in the first comment of this issue.