Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10942

[gcp] UPI installation with a separate /var partition leads to one master node mis-using the disk

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Normal Normal
    • None
    • 4.13.0
    • RHCOS
    • None
    • Critical
    • Yes
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The device sda should be the OS disk and sdb should be the additional disk for /var partition. But the problem master node seems not using sda, instead putting everything on sdb and leads to "The root filesystem is too small".

      Version-Release number of selected component (if applicable):

      4.13.0-0.nightly-2023-03-23-204038

      How reproducible:

      Always

      Steps to Reproduce:

      1. normal UPI installation, but along with configuring additional disks for /var partition 

      Actual results:

      One master node got the issue of "The root filesystem is too small", so that installation failed.

      Expected results:

      Installation should succeed.

      Additional info:

      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          39m     Unable to apply 4.13.0-0.nightly-2023-03-23-204038: some cluster operators are not available
      $ oc get nodes
      NAME                                                 STATUS   ROLES                  AGE   VERSION
      jiwei-0328a-fg99t-master-0.c.openshift-qe.internal   Ready    control-plane,master   38m   v1.26.2+dc93b13
      jiwei-0328a-fg99t-master-1.c.openshift-qe.internal   Ready    control-plane,master   37m   v1.26.2+dc93b13
      $ oc get machines -A
      No resources found
      $ oc get co | grep -v 'True        False         False'
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.13.0-0.nightly-2023-03-23-204038   False       False         True       39m     OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.234.161:443/healthz": dial tcp 172.30.234.161:443: connect: connection refused...
      console                                    4.13.0-0.nightly-2023-03-23-204038   False       False         True       32m     RouteHealthAvailable: console route is not admitted
      image-registry                                                                  False       True          True       32m     Available: The deployment does not have available replicas...
      ingress                                                                         False       True          True       32m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
      kube-controller-manager                    4.13.0-0.nightly-2023-03-23-204038   True        False         True       35m     GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host
      monitoring                                                                      False       True          True       28m     reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
      network                                    4.13.0-0.nightly-2023-03-23-204038   True        True          False      40m     Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
      $ 
      
      [core@jiwei-0328a-fg99t-int-svc ~]$ ssh -i .ssh/openshift-qe.pem core@10.0.0.5
      Red Hat Enterprise Linux CoreOS 413.92.202303190222-0
        Part of OpenShift 4.13, RHCOS is a Kubernetes native operating system
        managed by the Machine Config Operator (`clusteroperator/machine-config`).WARNING: Direct SSH access to machines is not recommended; instead,
      make configuration changes via `machineconfig` objects:
        https://docs.openshift.com/container-platform/4.13/architecture/architecture-rhcos.html---############################################################################
      WARNING: The root filesystem is too small. It is strongly recommended to
      allocate at least 8 GiB of space to allow for upgrades. From June 2021, this
      condition will trigger a failure in some cases. For more information, see:
      https://docs.fedoraproject.org/en-US/fedora-coreos/storage/You may delete this warning using:
      sudo rm /etc/motd.d/60-coreos-rootfs-size.motd
      ############################################################################Last login: Tue Mar 28 01:50:48 2023 from 10.0.0.2
      [core@jiwei-0328a-fg99t-master-2 ~]$ df -h
      Filesystem      Size  Used Avail Use% Mounted on
      devtmpfs        4.0M     0  4.0M   0% /dev
      tmpfs           7.4G   84K  7.4G   1% /dev/shm
      tmpfs           3.0G   46M  2.9G   2% /run
      /dev/sdb4       3.0G  2.8G  112M  97% /sysroot
      tmpfs           7.4G  4.0K  7.4G   1% /tmp
      /dev/sdb5       125G  1.5G  124G   2% /var
      /dev/sdb3       350M  103M  225M  32% /boot
      tmpfs           1.5G     0  1.5G   0% /run/user/1000
      [core@jiwei-0328a-fg99t-master-2 ~]$ lsblk
      NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0   128G  0 disk 
      sdb      8:16   0   128G  0 disk 
      ├─sdb1   8:17   0     1M  0 part 
      ├─sdb2   8:18   0   127M  0 part 
      ├─sdb3   8:19   0   384M  0 part /boot
      ├─sdb4   8:20   0     3G  0 part /sysroot/ostree/deploy/rhcos/var
      │                                /usr
      │                                /etc
      │                                /
      │                                /sysroot
      └─sdb5   8:21   0 124.5G  0 part /var
      [core@jiwei-0328a-fg99t-master-2 ~]$ sudo crictl ps
      FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" 
      [core@jiwei-0328a-fg99t-master-2 ~]$ sudo crictl img
      FATA[0000] unable to determine image API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" 
      [core@jiwei-0328a-fg99t-master-2 ~]$ 
      
      [core@jiwei-0328a-fg99t-master-0 ~]$ df -h
      Filesystem      Size  Used Avail Use% Mounted on
      devtmpfs        4.0M     0  4.0M   0% /dev
      tmpfs           7.4G     0  7.4G   0% /dev/shm
      tmpfs           3.0G   62M  2.9G   3% /run
      tmpfs           4.0M     0  4.0M   0% /sys/fs/cgroup
      /dev/sda4       128G  3.1G  125G   3% /sysroot
      tmpfs           7.4G   40K  7.4G   1% /tmp
      /dev/sdb1       128G   12G  117G   9% /var
      /dev/sda3       350M  103M  225M  32% /boot
      tmpfs           1.5G     0  1.5G   0% /run/user/1000
      [core@jiwei-0328a-fg99t-master-0 ~]$ lsblk
      NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0   128G  0 disk 
      ├─sda1   8:1    0     1M  0 part 
      ├─sda2   8:2    0   127M  0 part 
      ├─sda3   8:3    0   384M  0 part /boot
      └─sda4   8:4    0 127.5G  0 part /var/lib/kubelet/pods/d4b39dc4-6f59-46f4-9382-ba5fd230a1e8/volume-subpaths/etc/tuned/5
                                       /var/lib/kubelet/pods/d4b39dc4-6f59-46f4-9382-ba5fd230a1e8/volume-subpaths/etc/tuned/4
                                       /var/lib/kubelet/pods/d4b39dc4-6f59-46f4-9382-ba5fd230a1e8/volume-subpaths/etc/tuned/3
                                       /var/lib/kubelet/pods/d4b39dc4-6f59-46f4-9382-ba5fd230a1e8/volume-subpaths/etc/tuned/2
                                       /var/lib/kubelet/pods/d4b39dc4-6f59-46f4-9382-ba5fd230a1e8/volume-subpaths/etc/tuned/1
                                       /sysroot/ostree/deploy/rhcos/var
                                       /usr
                                       /etc
                                       /
                                       /sysroot
      sdb      8:16   0   128G  0 disk 
      └─sdb1   8:17   0   128G  0 part /var/lib/containers/storage/overlay
                                       /var
      [core@jiwei-0328a-fg99t-master-0 ~]$ 
      
      $ gcloud compute instances list --filter='name~jiwei-0328a'
      NAME                         ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP     STATUS
      jiwei-0328a-fg99t-bootstrap  us-central1-a  n1-standard-4               10.0.0.4     34.170.88.5     RUNNING
      jiwei-0328a-fg99t-master-0   us-central1-a  n1-standard-4               10.0.0.6                     RUNNING
      jiwei-0328a-fg99t-int-svc    us-central1-b  n2-standard-2               10.0.0.2     104.198.16.228  RUNNING
      jiwei-0328a-fg99t-master-1   us-central1-b  n1-standard-4               10.0.0.7                     RUNNING
      jiwei-0328a-fg99t-master-2   us-central1-c  n1-standard-4               10.0.0.5                     RUNNING
      $ gcloud compute disks list --filter='name~jiwei-0328a'
      NAME                          LOCATION       LOCATION_SCOPE  SIZE_GB  TYPE         STATUS
      jiwei-0328a-fg99t-bootstrap   us-central1-a  zone            128      pd-standard  READY
      jiwei-0328a-fg99t-master-0    us-central1-a  zone            128      pd-ssd       READY
      jiwei-0328a-fg99t-master-0-1  us-central1-a  zone            128      pd-ssd       READY
      jiwei-0328a-fg99t-int-svc     us-central1-b  zone            200      pd-standard  READY
      jiwei-0328a-fg99t-master-1    us-central1-b  zone            128      pd-ssd       READY
      jiwei-0328a-fg99t-master-1-1  us-central1-b  zone            128      pd-ssd       READY
      jiwei-0328a-fg99t-master-2    us-central1-c  zone            128      pd-ssd       READY
      jiwei-0328a-fg99t-master-2-1  us-central1-c  zone            128      pd-ssd       READY
      $ 
      

              Unassigned Unassigned
              rhn-support-jiwei Jianli Wei
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: