-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
Sprint 239 - Update&Remoting
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
OCP 4.14 installation did not work - nodes with >3TB disk size and RAID-0 were complaining
Version-Release number of selected component (if applicable):
The first time this issue appeared was in this nightly build: OpenShift 4.14 nightly 2023-05-18 17:38. Since then, the issue has appeared all the times in the affected cluster. Just to let you know, latest dates and OCP 4.14 releases where the deployment worked fine in this cluster are: - Nightlies: OpenShift 4.14 nightly 2023-04-19 12:56 (job launched on 2023/04/26) - ECs: OpenShift 4.14.0 ec.0 (job launched on 2023/06/04)
How reproducible:
100%
Steps to Reproduce:
1. Deploy latest OCP 4.14 in a cluster composed by 3 master nodes and 4 worker nodes, using IPI installation and Ansible playbooks from baremetal-deployment. Distributed-CI (DCI) was used to automate the whole installation. The nodes have the following storage configuration, where the nodes with 1.7T disk size use RAID-1, and the ones with 3.5T use RAID-0 (these are master-0 and worker-1).
$ for x in {0..2} ; do echo "===== master-$x =====" ; ssh core@master-$x "lsblk | head -2" 2>/dev/null ; done
===== master-0 =====
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 3.5T 0 disk
===== master-1 =====
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 1.7T 0 disk
===== master-2 =====
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 1.7T 0 disk
$ for x in {0..3} ; do echo "===== worker-$x =====" ; ssh core@worker-$x "lsblk | head -2" 2>/dev/null ; done
===== worker-0 =====
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 1.7T 0 disk
===== worker-1 =====
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 3.5T 0 disk
===== worker-2 =====
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 1.7T 0 disk
===== worker-3 =====
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 1.7T 0 disk
Actual results:
Installation did not work, failing in the bootstrap. Only the nodes with a 1.7T disk size and RAID-1 were correctly deployed: $ oc get nodes NAME STATUS ROLES AGE VERSION master-1 Ready control-plane,master 3h7m v1.27.2+cc041e8 master-2 Ready control-plane,master 3h8m v1.27.2+cc041e8 worker-0 Ready worker 141m v1.27.2+cc041e8 worker-2 Ready worker 141m v1.27.2+cc041e8 worker-3 Ready worker 141m v1.27.2+cc041e8 If we check the journal of the affected nodes, we can see the OS installation is recurrently failing and not able to finish: [core@master-0 ~]$ sudo journalctl ... Jun 06 17:50:00 master-0 machine-config-daemon[46348]: Staging deployment...done Jun 06 17:50:00 master-0 rpm-ostree[46332]: Created new deployment /ostree/deploy/rhcos/deploy/bff15671999add901e2d03797aedba737b71f825f022d716182512ad7f627132.0 Jun 06 17:50:00 master-0 rpm-ostree[46377]: bwrap: execvp /usr/bin/true: Permission denied Jun 06 17:50:00 master-0 rpm-ostree[46332]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: Sanity-checking final rootfs: bwrap(/usr/bin/true): Child process killed by signal 1 Jun 06 17:50:00 master-0 rpm-ostree[46332]: Unlocked sysroot Jun 06 17:50:00 master-0 rpm-ostree[46332]: Process [pid: 46348 uid: 0 unit: machine-config-daemon-firstboot.service] disconnected from transaction progress Jun 06 17:50:00 master-0 rpm-ostree[46332]: client(id:cli dbus:1.669 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0 Jun 06 17:50:00 master-0 rpm-ostree[46332]: In idle state; will auto-exit in 64 seconds Jun 06 17:50:00 master-0 machine-config-daemon[5508]: I0606 17:50:00.217896 5508 update.go:2136] Rolling back applied changes to OS due to error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc00 3a4c6c1a4223fd26009267b444af1ae7b67d6a8f27f7b0f6a4e : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc003a4c6c1a4223fd26009267b444af1ae7b67d6a8f27f7 b0f6a4e: error: Sanity-checking final rootfs: bwrap(/usr/bin/true): Child process killed by signal 1 Jun 06 17:50:00 master-0 machine-config-daemon[5508]: : exit status 1 Jun 06 17:50:00 master-0 machine-config-daemon[5508]: I0606 17:50:00.217913 5508 update.go:2010] Running: rpm-ostree cleanup -p Jun 06 17:50:00 master-0 rpm-ostree[46332]: client(id:cli dbus:1.670 unit:machine-config-daemon-firstboot.service uid:0) added; new total=1 Jun 06 17:50:00 master-0 rpm-ostree[46332]: Loaded sysroot Jun 06 17:50:00 master-0 rpm-ostree[46332]: Locked sysroot Jun 06 17:50:00 master-0 rpm-ostree[46332]: Initiated txn Cleanup for client(id:cli dbus:1.670 unit:machine-config-daemon-firstboot.service uid:0): /org/projectatomic/rpmostree1/rhcos Jun 06 17:50:00 master-0 rpm-ostree[46332]: Process [pid: 46378 uid: 0 unit: machine-config-daemon-firstboot.service] connected to transaction progress Jun 06 17:50:00 master-0 rpm-ostree[46332]: Pruned container image layers: 0 Jun 06 17:50:01 master-0 rpm-ostree[46332]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful Jun 06 17:50:01 master-0 rpm-ostree[46332]: Unlocked sysroot Jun 06 17:50:01 master-0 rpm-ostree[46332]: Process [pid: 46378 uid: 0 unit: machine-config-daemon-firstboot.service] disconnected from transaction progress Jun 06 17:50:01 master-0 rpm-ostree[46332]: client(id:cli dbus:1.670 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0 Jun 06 17:50:01 master-0 rpm-ostree[46332]: In idle state; will auto-exit in 62 seconds Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150816 5508 update.go:1213] Updating files Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150845 5508 update.go:1279] Deleting stale data Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150850 5508 update.go:2055] Removing SIGTERM protection Jun 06 17:50:01 master-0 machine-config-daemon[5508]: W0606 17:50:01.150857 5508 firstboot_complete_machineconfig.go:63] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc003a4c6c1a4223fd2600 9267b444af1ae7b67d6a8f27f7b0f6a4e : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc003a4c6c1a4223fd26009267b444af1ae7b67d6a8f27f7b0f6a4e: error: Sa nity-checking final rootfs: bwrap(/usr/bin/true): Child process killed by signal 1 Jun 06 17:50:01 master-0 machine-config-daemon[5508]: : exit status 1 Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150866 5508 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry
Expected results:
Nodes should be able to pass the OCP installation and to join the cluster correctly, without any issue related to storage size and RAID configuration. In fact, we have applied RAID-1 to these two nodes and now the installation is passing.
Additional info:
Please find attached to this Jira card the Distributed-CI jobs where this issue did not appear and the deployment worked fine: - Last nightly build working: OpenShift 4.14 nightly 2023-04-19 12:56 - https://www.distributed-ci.io/jobs/26adc07e-c84a-4932-8083-889d47bcc22c/jobStates?sort=date (job launched on 2023/04/26) - Last EC working: OpenShift 4.14.0 ec.0 - https://www.distributed-ci.io/jobs/b72dcd17-769c-4659-af74-eccdb675c513/jobStates?sort=date (job launched on 2023/06/04) And now, here we have examples of deployments where this issue was present: - The first time this issue appeared was in this nightly build: OpenShift 4.14 nightly 2023-05-18 17:38 - https://www.distributed-ci.io/jobs/567d584f-f949-42a5-b971-ca8dd4c99dc7/jobStates (job launched on 2023/05/20) - Since then, we never had a correct deployment with OCP 4.14 in the cluster. Last failed job with this issue used OpenShift 4.14 nightly 2023-06-05 11:30: https://www.distributed-ci.io/jobs/321d8bde-7a0c-49a4-96b2-f879c4088a62/jobStates (job launched on 2023/06/06). This is the job from which we have extracted the logs provided before. - Then, we observed the RAID configuration and master-0 and worker-1 had RAID-0 config instead of RAID-1. The config for master-0 was changed and now the installation worked fine there, but not in worker-1. This job is based on OpenShift 4.14 nightly 2023-06-05 11:30 - https://www.distributed-ci.io/jobs/c0e2d86d-8182-44db-807d-6cbbeeaaea23/jobStates?sort=date (launched on 2023/06/07) In Logs section, you can see the tasks performed over the cluster. In Files section, you can find some useful log files, such as the journals from all the nodes used, called journal-<node_name>.log, and for the last job where master-0 was fixed but not worker-1, there is also available the must_gather.tar.gz of the deployment. An sosreport of this worker-1 has also been created but will be provided in a separate comment.