Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14694

OCP 4.14 installation did not work - nodes with >3TB disk size and RAID-0 were complaining

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.14
    • RHCOS
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • None
    • Sprint 239 - Update&Remoting
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      OCP 4.14 installation did not work - nodes with >3TB disk size and RAID-0 were complaining

      Version-Release number of selected component (if applicable):

      The first time this issue appeared was in this nightly build: OpenShift 4.14 nightly 2023-05-18 17:38. Since then, the issue has appeared all the times in the affected cluster.
      
      Just to let you know, latest dates and OCP 4.14 releases where the deployment worked fine in this cluster are:
      
      - Nightlies: OpenShift 4.14 nightly 2023-04-19 12:56 (job launched on 2023/04/26)
      - ECs: OpenShift 4.14.0 ec.0 (job launched on 2023/06/04)

      How reproducible:

      100%

      Steps to Reproduce:

      1. Deploy latest OCP 4.14 in a cluster composed by 3 master nodes and 4 worker nodes, using IPI installation and Ansible playbooks from baremetal-deployment. Distributed-CI (DCI) was used to automate the whole installation. The nodes have the following storage configuration, where the nodes with 1.7T disk size use RAID-1, and the ones with 3.5T use RAID-0 (these are master-0 and worker-1).
      
      $ for x in {0..2} ; do echo "===== master-$x =====" ; ssh core@master-$x "lsblk | head -2" 2>/dev/null ; done
      ===== master-0 =====
      NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0  3.5T  0 disk 
      ===== master-1 =====
      NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0  1.7T  0 disk 
      ===== master-2 =====
      NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0  1.7T  0 disk 
      
      $ for x in {0..3} ; do echo "===== worker-$x =====" ; ssh core@worker-$x "lsblk | head -2" 2>/dev/null ; done
      ===== worker-0 =====
      NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0  1.7T  0 disk 
      ===== worker-1 =====
      NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0  3.5T  0 disk 
      ===== worker-2 =====
      NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0  1.7T  0 disk 
      ===== worker-3 =====
      NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0  1.7T  0 disk  

      Actual results:

      Installation did not work, failing in the bootstrap. Only the nodes with a 1.7T disk size and RAID-1 were correctly deployed:
      
      $ oc get nodes
      NAME       STATUS   ROLES                  AGE    VERSION
      master-1   Ready    control-plane,master   3h7m   v1.27.2+cc041e8
      master-2   Ready    control-plane,master   3h8m   v1.27.2+cc041e8
      worker-0   Ready    worker                 141m   v1.27.2+cc041e8
      worker-2   Ready    worker                 141m   v1.27.2+cc041e8
      worker-3   Ready    worker                 141m   v1.27.2+cc041e8
      
      If we check the journal of the affected nodes, we can see the OS installation is recurrently failing and not able to finish:
      
      [core@master-0 ~]$ sudo journalctl
      ...
      Jun 06 17:50:00 master-0 machine-config-daemon[46348]: Staging deployment...done
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: Created new deployment /ostree/deploy/rhcos/deploy/bff15671999add901e2d03797aedba737b71f825f022d716182512ad7f627132.0
      Jun 06 17:50:00 master-0 rpm-ostree[46377]: bwrap: execvp /usr/bin/true: Permission denied
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: Sanity-checking final rootfs: bwrap(/usr/bin/true): Child process killed by signal 1
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: Unlocked sysroot
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: Process [pid: 46348 uid: 0 unit: machine-config-daemon-firstboot.service] disconnected from transaction progress
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: client(id:cli dbus:1.669 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: In idle state; will auto-exit in 64 seconds
      Jun 06 17:50:00 master-0 machine-config-daemon[5508]: I0606 17:50:00.217896    5508 update.go:2136] Rolling back applied changes to OS due to error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc00
      3a4c6c1a4223fd26009267b444af1ae7b67d6a8f27f7b0f6a4e : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc003a4c6c1a4223fd26009267b444af1ae7b67d6a8f27f7
      b0f6a4e: error: Sanity-checking final rootfs: bwrap(/usr/bin/true): Child process killed by signal 1
      Jun 06 17:50:00 master-0 machine-config-daemon[5508]: : exit status 1
      Jun 06 17:50:00 master-0 machine-config-daemon[5508]: I0606 17:50:00.217913    5508 update.go:2010] Running: rpm-ostree cleanup -p
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: client(id:cli dbus:1.670 unit:machine-config-daemon-firstboot.service uid:0) added; new total=1
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: Loaded sysroot
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: Locked sysroot
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: Initiated txn Cleanup for client(id:cli dbus:1.670 unit:machine-config-daemon-firstboot.service uid:0): /org/projectatomic/rpmostree1/rhcos
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: Process [pid: 46378 uid: 0 unit: machine-config-daemon-firstboot.service] connected to transaction progress
      Jun 06 17:50:00 master-0 rpm-ostree[46332]: Pruned container image layers: 0
      Jun 06 17:50:01 master-0 rpm-ostree[46332]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful
      Jun 06 17:50:01 master-0 rpm-ostree[46332]: Unlocked sysroot
      Jun 06 17:50:01 master-0 rpm-ostree[46332]: Process [pid: 46378 uid: 0 unit: machine-config-daemon-firstboot.service] disconnected from transaction progress
      Jun 06 17:50:01 master-0 rpm-ostree[46332]: client(id:cli dbus:1.670 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0
      Jun 06 17:50:01 master-0 rpm-ostree[46332]: In idle state; will auto-exit in 62 seconds
      Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150816    5508 update.go:1213] Updating files
      Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150845    5508 update.go:1279] Deleting stale data
      Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150850    5508 update.go:2055] Removing SIGTERM protection
      Jun 06 17:50:01 master-0 machine-config-daemon[5508]: W0606 17:50:01.150857    5508 firstboot_complete_machineconfig.go:63] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc003a4c6c1a4223fd2600
      9267b444af1ae7b67d6a8f27f7b0f6a4e : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ddabe5f03cc003a4c6c1a4223fd26009267b444af1ae7b67d6a8f27f7b0f6a4e: error: Sa
      nity-checking final rootfs: bwrap(/usr/bin/true): Child process killed by signal 1
      Jun 06 17:50:01 master-0 machine-config-daemon[5508]: : exit status 1
      Jun 06 17:50:01 master-0 machine-config-daemon[5508]: I0606 17:50:01.150866    5508 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry

      Expected results:

      Nodes should be able to pass the OCP installation and to join the cluster correctly, without any issue related to storage size and RAID configuration.
      
      In fact, we have applied RAID-1 to these two nodes and now the installation is passing.

      Additional info:

      Please find attached to this Jira card the Distributed-CI jobs where this issue did not appear and the deployment worked fine:
      
      - Last nightly build working: OpenShift 4.14 nightly 2023-04-19 12:56 - https://www.distributed-ci.io/jobs/26adc07e-c84a-4932-8083-889d47bcc22c/jobStates?sort=date (job launched on 2023/04/26)
      - Last EC working: OpenShift 4.14.0 ec.0 - https://www.distributed-ci.io/jobs/b72dcd17-769c-4659-af74-eccdb675c513/jobStates?sort=date (job launched on 2023/06/04)
      
      And now, here we have examples of deployments where this issue was present:
      
      - The first time this issue appeared was in this nightly build: OpenShift 4.14 nightly 2023-05-18 17:38 - https://www.distributed-ci.io/jobs/567d584f-f949-42a5-b971-ca8dd4c99dc7/jobStates (job launched on 2023/05/20)
      - Since then, we never had a correct deployment with OCP 4.14 in the cluster. Last failed job with this issue used OpenShift 4.14 nightly 2023-06-05 11:30: https://www.distributed-ci.io/jobs/321d8bde-7a0c-49a4-96b2-f879c4088a62/jobStates (job launched on 2023/06/06). This is the job from which we have extracted the logs provided before.
      - Then, we observed the RAID configuration and master-0 and worker-1 had RAID-0 config instead of RAID-1. The config for master-0 was changed and now the installation worked fine there, but not in worker-1. This job is based on OpenShift 4.14 nightly 2023-06-05 11:30 - https://www.distributed-ci.io/jobs/c0e2d86d-8182-44db-807d-6cbbeeaaea23/jobStates?sort=date (launched on 2023/06/07)
      
      In Logs section, you can see the tasks performed over the cluster.
      
      In Files section, you can find some useful log files, such as the journals from all the nodes used, called journal-<node_name>.log, and for the last job where master-0 was fixed but not worker-1, there is also available the must_gather.tar.gz of the deployment.
      
      An sosreport of this worker-1 has also been created but will be provided in a separate comment.

              walters@redhat.com Colin Walters
              raperez@redhat.com Ramon Perez
              None
              None
              Pedro Jose Amoedo Martinez Pedro Jose Amoedo Martinez
              None
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: