Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.14
Component/s: RHCOS
Labels:
- telco
- updateRemotingTeam

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
Sprint 239 - Update&Remoting
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

OCP 4.14 installation did not work - nodes with >3TB disk size and RAID-0 were complaining

Version-Release number of selected component (if applicable):

The first time this issue appeared was in this nightly build: OpenShift 4.14 nightly 2023-05-18 17:38. Since then, the issue has appeared all the times in the affected cluster.

Just to let you know, latest dates and OCP 4.14 releases where the deployment worked fine in this cluster are:

- Nightlies: OpenShift 4.14 nightly 2023-04-19 12:56 (job launched on 2023/04/26)
- ECs: OpenShift 4.14.0 ec.0 (job launched on 2023/06/04)

How reproducible:

100%

Steps to Reproduce:

1. Deploy latest OCP 4.14 in a cluster composed by 3 master nodes and 4 worker nodes, using IPI installation and Ansible playbooks from baremetal-deployment. Distributed-CI (DCI) was used to automate the whole installation. The nodes have the following storage configuration, where the nodes with 1.7T disk size use RAID-1, and the ones with 3.5T use RAID-0 (these are master-0 and worker-1).

$ for x in {0..2} ; do echo "===== master-$x =====" ; ssh core@master-$x "lsblk | head -2" 2>/dev/null ; done
===== master-0 =====
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  3.5T  0 disk 
===== master-1 =====
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  1.7T  0 disk 
===== master-2 =====
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  1.7T  0 disk 

$ for x in {0..3} ; do echo "===== worker-$x =====" ; ssh core@worker-$x "lsblk | head -2" 2>/dev/null ; done
===== worker-0 =====
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  1.7T  0 disk 
===== worker-1 =====
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  3.5T  0 disk 
===== worker-2 =====
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  1.7T  0 disk 
===== worker-3 =====
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  1.7T  0 disk

Actual results:

Installation did not work, failing in the bootstrap. Only the nodes with a 1.7T disk size and RAID-1 were correctly deployed:

$ oc get nodes
NAME       STATUS   ROLES                  AGE    VERSION
master-1   Ready    control-plane,master   3h7m   v1.27.2+cc041e8
master-2   Ready    control-plane,master   3h8m   v1.27.2+cc041e8
worker-0   Ready    worker                 141m   v1.27.2+cc041e8
worker-2   Ready    worker                 141m   v1.27.2+cc041e8
worker-3   Ready    worker                 141m   v1.27.2+cc041e8

If we check the journal of the affected nodes, we can see the OS installation is recurrently failing and not able to finish:

[core@master-0 ~]$ sudo journalctl
...
Jun 06 17:50:00 master-0 machine-config-daemon[46348]: Staging deployment...done
Jun 06 17:50:00 master-0 rpm-ostree[46332]: Created new deployment /ostree/deploy/rhcos/deploy/bff15671999add901e2d03797aedba737b71f825f022d716182512ad7f627132.0
Jun 06 17:50:00 master-0 rpm-ostree[46377]: bwrap: execvp /usr/bin/true: Permission denied
Jun 06 17:50:00 master-0 rpm-ostree[46332]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: Sanity-checking final rootfs: bwrap(/usr/bin/true): Child process killed by signal 1
Jun 06 17:50:00 master-0 rpm-ostree[46332]: Unlocked sysroot
Jun 06 17:50:00 master-0 rpm-ostree[46332]: Process [pid: 46348 uid: 0 unit: machine-config-daemon-firstboot.service] disconnected from transaction progress
Jun 06 17:50:00 master-0 rpm-ostree[46332]: client(id:cli dbus:1.669 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0
Jun 06 17:50:00 master-0 rpm-ostree[46332]: In idle state; will auto-exit in 64 seconds
Jun 06 17:50:00 master-0 machine-config-daemon[5508]: I0606 17:50:00.217896    5508 update.go:2136] Rolling back applied changes to OS due to error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc00
3a4c6c1a4223fd26009267b444af1ae7b67d6a8f27f7b0f6a4e : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc003a4c6c1a4223fd26009267b444af1ae7b67d6a8f27f7
b0f6a4e: error: Sanity-checking final rootfs: bwrap(/usr/bin/true): Child process killed by signal 1
Jun 06 17:50:00 master-0 machine-config-daemon[5508]: : exit status 1
Jun 06 17:50:00 master-0 machine-config-daemon[5508]: I0606 17:50:00.217913    5508 update.go:2010] Running: rpm-ostree cleanup -p
Jun 06 17:50:00 master-0 rpm-ostree[46332]: client(id:cli dbus:1.670 unit:machine-config-daemon-firstboot.service uid:0) added; new total=1
Jun 06 17:50:00 master-0 rpm-ostree[46332]: Loaded sysroot
Jun 06 17:50:00 master-0 rpm-ostree[46332]: Locked sysroot
Jun 06 17:50:00 master-0 rpm-ostree[46332]: Initiated txn Cleanup for client(id:cli dbus:1.670 unit:machine-config-daemon-firstboot.service uid:0): /org/projectatomic/rpmostree1/rhcos
Jun 06 17:50:00 master-0 rpm-ostree[46332]: Process [pid: 46378 uid: 0 unit: machine-config-daemon-firstboot.service] connected to transaction progress
Jun 06 17:50:00 master-0 rpm-ostree[46332]: Pruned container image layers: 0
Jun 06 17:50:01 master-0 rpm-ostree[46332]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful
Jun 06 17:50:01 master-0 rpm-ostree[46332]: Unlocked sysroot
Jun 06 17:50:01 master-0 rpm-ostree[46332]: Process [pid: 46378 uid: 0 unit: machine-config-daemon-firstboot.service] disconnected from transaction progress
Jun 06 17:50:01 master-0 rpm-ostree[46332]: client(id:cli dbus:1.670 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0
Jun 06 17:50:01 master-0 rpm-ostree[46332]: In idle state; will auto-exit in 62 seconds
Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150816    5508 update.go:1213] Updating files
Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150845    5508 update.go:1279] Deleting stale data
Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150850    5508 update.go:2055] Removing SIGTERM protection
Jun 06 17:50:01 master-0 machine-config-daemon[5508]: W0606 17:50:01.150857    5508 firstboot_complete_machineconfig.go:63] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc003a4c6c1a4223fd2600
9267b444af1ae7b67d6a8f27f7b0f6a4e : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc003a4c6c1a4223fd26009267b444af1ae7b67d6a8f27f7b0f6a4e: error: Sa
nity-checking final rootfs: bwrap(/usr/bin/true): Child process killed by signal 1
Jun 06 17:50:01 master-0 machine-config-daemon[5508]: : exit status 1
Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150866    5508 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry

Expected results:

Nodes should be able to pass the OCP installation and to join the cluster correctly, without any issue related to storage size and RAID configuration.

In fact, we have applied RAID-1 to these two nodes and now the installation is passing.

Additional info:

Please find attached to this Jira card the Distributed-CI jobs where this issue did not appear and the deployment worked fine:

- Last nightly build working: OpenShift 4.14 nightly 2023-04-19 12:56 - https://www.distributed-ci.io/jobs/26adc07e-c84a-4932-8083-889d47bcc22c/jobStates?sort=date (job launched on 2023/04/26)
- Last EC working: OpenShift 4.14.0 ec.0 - https://www.distributed-ci.io/jobs/b72dcd17-769c-4659-af74-eccdb675c513/jobStates?sort=date (job launched on 2023/06/04)

And now, here we have examples of deployments where this issue was present:

- The first time this issue appeared was in this nightly build: OpenShift 4.14 nightly 2023-05-18 17:38 - https://www.distributed-ci.io/jobs/567d584f-f949-42a5-b971-ca8dd4c99dc7/jobStates (job launched on 2023/05/20)
- Since then, we never had a correct deployment with OCP 4.14 in the cluster. Last failed job with this issue used OpenShift 4.14 nightly 2023-06-05 11:30: https://www.distributed-ci.io/jobs/321d8bde-7a0c-49a4-96b2-f879c4088a62/jobStates (job launched on 2023/06/06). This is the job from which we have extracted the logs provided before.
- Then, we observed the RAID configuration and master-0 and worker-1 had RAID-0 config instead of RAID-1. The config for master-0 was changed and now the installation worked fine there, but not in worker-1. This job is based on OpenShift 4.14 nightly 2023-06-05 11:30 - https://www.distributed-ci.io/jobs/c0e2d86d-8182-44db-807d-6cbbeeaaea23/jobStates?sort=date (launched on 2023/06/07)

In Logs section, you can see the tasks performed over the cluster.

In Files section, you can find some useful log files, such as the journals from all the nodes used, called journal-<node_name>.log, and for the last job where master-0 was fixed but not worker-1, there is also available the must_gather.tar.gz of the deployment.

An sosreport of this worker-1 has also been created but will be provided in a separate comment.

Assignee:: Colin Walters

Reporter:: Ramon Perez

Need Info From:: None

Contributors:: None

QA Contact:: Pedro Jose Amoedo Martinez

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/06/07 3:41 PM

Updated:: 2025/07/26 11:42 AM

Resolved:: 2023/07/26 9:11 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates