Uploaded image for project: 'Multiple Architecture Enablement'
  1. Multiple Architecture Enablement
  2. MULTIARCH-1648

"no space left on device" issue is seen on latest 4.8 builds

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • 4.8.z
    • 4.8
    • Multi-Arch
    • None
    • False
    • False
    • NEW
    • NEW

      Also seen with 4.8.5. Renders cluster unusable after 2 days on Power.

      +++ This bug was initially created as a clone of Bug #1997062 +++

      Issue is with below builds :
      4.9.0-0.nightly-ppc64le-2021-08-17-145337
      4.9.0-0.nightly-ppc64le-2021-08-19-120135

      on bastion :

      1. lscpu
        Architecture: ppc64le
        Byte Order: Little Endian
        CPU(s): 8
        On-line CPU(s) list: 0-7
        Thread(s) per core: 8
        Core(s) per socket: 1
        Socket(s): 1
        NUMA node(s): 1
        Model: 2.3 (pvr 004e 0203)
        Model name: POWER9 (architected), altivec supported
        Hypervisor vendor: pHyp
        Virtualization type: para
        L1d cache: 32K
        L1i cache: 32K
        NUMA node0 CPU(s): 0-7
        Physical sockets: 2
        Physical chips: 1
        Physical cores/chip: 10

      [core@master-0 ~]$ lscpu
      Architecture: ppc64le
      Byte Order: Little Endian
      CPU(s): 8
      On-line CPU(s) list: 0-7
      Thread(s) per core: 8
      Core(s) per socket: 1
      Socket(s): 1
      NUMA node(s): 1
      Model: 2.3 (pvr 004e 0203)
      Model name: POWER9 (architected), altivec supported
      Hypervisor vendor: pHyp
      Virtualization type: para
      L1d cache: 32K
      L1i cache: 32K
      NUMA node0 CPU(s): 0-7

      No workload was deployed on the cluster.

      1. oc get nodes
        NAME STATUS ROLES AGE VERSION
        master-0 Ready master 6d17h v1.22.0-rc.0+3dfed96
        master-1 Ready master 6d17h v1.22.0-rc.0+3dfed96
        master-2 Ready master 6d17h v1.22.0-rc.0+3dfed96
        worker-0 Ready worker 6d17h v1.22.0-rc.0+3dfed96
        worker-1 Ready worker 6d17h v1.22.0-rc.0+3dfed96

      — Additional comment from Alisha on 2021-08-24 13:10:32 UTC —

      [root@master-0 ~]# df -h
      Filesystem Size Used Avail Use% Mounted on
      devtmpfs 7.9G 0 7.9G 0% /dev
      tmpfs 8.0G 256K 8.0G 1% /dev/shm
      tmpfs 8.0G 7.9G 151M 99% /run
      tmpfs 8.0G 0 8.0G 0% /sys/fs/cgroup
      /dev/sda4 120G 17G 104G 14% /sysroot
      tmpfs 8.0G 64K 8.0G 1% /tmp
      /dev/sdb3 364M 233M 109M 69% /boot
      overlay 8.0G 7.9G 151M 99% /etc/NetworkManager/systemConnectionsMerged

      — Additional comment from Alisha on 2021-08-24 13:25:17 UTC —

      Platform is ppc64le.

      OS info :

      on bastion :

      1. cat /etc/redhat-release
        Red Hat Enterprise Linux release 8.4 (Ootpa)

      CoreOS nodes :
      [core@master-0 ~]$ cat /etc/redhat-release
      Red Hat Enterprise Linux CoreOS release 4.9

      — Additional comment from Manoj Kumar on 2021-08-29 22:40:16 UTC —

      I did some more digging. Found a tool to snoop on exec(). https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py

      With a compiled version of the tool, I was able to correlate the new processes to the contents of /run/crio/exec-pid-dir:

      https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py

      [root@rdr-cicd-e6b7-mon01-master-1 execsnoop]# ./execsnoop
      In file included from <built-in>:2:
      In file included from /virtual/include/bcc/bpf.h:12:
      In file included from include/linux/types.h:6:
      In file included from include/uapi/linux/types.h:14:
      In file included from include/uapi/linux/posix_types.h:5:
      In file included from include/linux/stddef.h:5:
      In file included from include/uapi/linux/stddef.h:2:
      In file included from include/linux/compiler_types.h:74:
      include/linux/compiler-clang.h:25:9: warning: '__no_sanitize_address' macro redefined [-Wmacro-redefined]
      #define __no_sanitize_address
      ^
      include/linux/compiler-gcc.h:213:9: note: previous definition is here
      #define _no_sanitize_address __attribute_((no_sanitize_address))
      ^
      1 warning generated.
      PCOMM PID PPID RET ARGS
      ldd 3733034 2553 0 /usr/bin/ldd /usr/bin/crio
      ld64.so.2 3733035 3733034 0 /lib64/ld64.so.2 --verify /usr/bin/crio
      ld64.so.2 3733038 3733037 0 /lib64/ld64.so.2 /usr/bin/crio
      sh 3733039 5068 0 /usr/bin/awk -F = '/partition_id/

      { print $2 }' /proc/ppc64/lparcfg
      awk 3733039 5068 0 /usr/bin/awk -F = /partition_id/ { print $2 }

      /proc/ppc64/lparcfg
      sh 3733040 5068 0 /usr/bin/awk -F = '/partition_id/

      { print $2 }' /proc/ppc64/lparcfg
      awk 3733040 5068 0 /usr/bin/awk -F = /partition_id/ { print $2 }

      /proc/ppc64/lparcfg
      sh 3733041 5068 0 /usr/bin/awk -F = '/partition_id/

      { print $2 }' /proc/ppc64/lparcfg
      awk 3733041 5068 0 /usr/bin/awk -F = /partition_id/ { print $2 }

      /proc/ppc64/lparcfg
      sh 3733042 5313 0 /usr/bin/awk -F = '/partition_id/

      { print $2 }' /proc/ppc64/lparcfg
      awk 3733042 5313 0 /usr/bin/awk -F = /partition_id/ { print $2 }

      /proc/ppc64/lparcfg
      sh 3733043 5313 0 /usr/bin/awk -F = '/partition_id/

      { print $2 }' /proc/ppc64/lparcfg
      awk 3733043 5313 0 /usr/bin/awk -F = /partition_id/ { print $2 }

      /proc/ppc64/lparcfg
      sh 3733044 5313 0 /usr/bin/awk -F = '/partition_id/

      { print $2 }' /proc/ppc64/lparcfg
      awk 3733044 5313 0 /usr/bin/awk -F = /partition_id/ { print $2 }

      /proc/ppc64/lparcfg
      md5sum 3733046 3733045 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
      awk 3733047 3733045 0 /usr/bin/awk

      {print $1}
      md5sum 3733049 3733048 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      awk 3733050 3733048 0 /usr/bin/awk {print $1}

      sleep 3733051 3709 0 /usr/bin/sleep 1
      md5sum 3733053 3733052 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
      awk 3733054 3733052 0 /usr/bin/awk

      {print $1}
      md5sum 3733056 3733055 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      awk 3733057 3733055 0 /usr/bin/awk {print $1}

      sleep 3733058 3709 0 /usr/bin/sleep 1
      runc 3733059 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe36d95a00c-82c2-4fbe-b015-b2ab89cf3303 --process /tmp/exec-process-074617211 1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe3
      exe 3733068 3733059 0 /proc/self/exe init
      test 3733070 3733059 0 /usr/bin/test -f /etc/cni/net.d/80-openshift-network.conf
      md5sum 3733077 3733076 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
      awk 3733078 3733076 0 /usr/bin/awk

      {print $1}
      md5sum 3733080 3733079 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      awk 3733081 3733079 0 /usr/bin/awk {print $1}

      sleep 3733082 3709 0 /usr/bin/sleep 1
      runc 3733083 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace50931519341ee0-e912-4070-a4f8-14d9196f1352 --process /tmp/exec-process-197278878 afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace509315
      exe 3733093 3733083 0 /proc/self/exe init
      bash 3733095 3733083 0 /bin/bash -c set -xe\n\n# Unix sockets are used for health checks to ensure that the pod is reporting readiness of the etcd process\n# in this c
      etcdctl 3733101 3733095 0 /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://193.168.200.231:0 endpoint health -w json
      grep 3733102 3733095 0 /usr/bin/grep "health":true
      awk 3733112 3733110 0 /usr/bin/awk

      {print $1}
      md5sum 3733111 3733110 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
      md5sum 3733114 3733113 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      awk 3733115 3733113 0 /usr/bin/awk {print $1}

      sleep 3733116 3709 0 /usr/bin/sleep 1
      runc 3733117 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664df311a28-07fb-4c11-9bb9-9ccdf005adc2 --process /tmp/exec-process-392734080 612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664
      exe 3733126 3733117 0 /proc/self/exe init
      sh 3733131 3733117 0 /bin/sh -c declare -r health_endpoint="https://localhost:2379/health"\ndeclare -r cert="/var/run/secrets/etcd-client/tls.crt"\ndeclare -r key
      grep 3733138 3733131 0 /usr/bin/grep "health":"true"
      curl 3733137 3733131 0 /usr/bin/curl --max-time 2 --silent --cert /var/run/secrets/etcd-client/tls.crt --key /var/run/secrets/etcd-client/tls.key --cacert /var/run/configmaps/etcd-ca/ca-bundle.crt https://localhost:2379/health
      md5sum 3733141 3733140 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
      awk 3733142 3733140 0 /usr/bin/awk

      {print $1}
      md5sum 3733144 3733143 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      awk 3733145 3733143 0 /usr/bin/awk {print $1}

      sleep 3733146 3709 0 /usr/bin/sleep 1
      md5sum 3733148 3733147 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
      awk 3733149 3733147 0 /usr/bin/awk

      {print $1}
      md5sum 3733151 3733150 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      awk 3733152 3733150 0 /usr/bin/awk {print $1}

      sleep 3733153 3709 0 /usr/bin/sleep 1
      md5sum 3733155 3733154 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
      awk 3733156 3733154 0 /usr/bin/awk

      {print $1}
      md5sum 3733158 3733157 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      awk 3733159 3733157 0 /usr/bin/awk {print $1}

      sleep 3733160 3709 0 /usr/bin/sleep 1
      runc 3733161 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe37b92324c-0714-46c1-b517-3ab9786ca397 --process /tmp/exec-process-050607721 1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe3
      exe 3733169 3733161 0 /proc/self/exe init
      test 3733173 3733161 0 /usr/bin/test -f /etc/cni/net.d/80-openshift-network.conf
      md5sum 3733180 3733179 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
      awk 3733181 3733179 0 /usr/bin/awk

      {print $1}
      md5sum 3733183 3733182 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      awk 3733184 3733182 0 /usr/bin/awk {print $1}

      sleep 3733185 3709 0 /usr/bin/sleep 1
      runc 3733186 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace5093158e423623-9b1d-45d4-b3f3-63eea92c797b --process /tmp/exec-process-598901766 afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace509315
      exe 3733195 3733186 0 /proc/self/exe init
      bash 3733197 3733186 0 /bin/bash -c set -xe\n\n# Unix sockets are used for health checks to ensure that the pod is reporting readiness of the etcd process\n# in this c
      etcdctl 3733203 3733197 0 /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://193.168.200.231:0 endpoint health -w json
      grep 3733204 3733197 0 /usr/bin/grep "health":true
      awk 3733213 3733211 0 /usr/bin/awk

      {print $1}
      md5sum 3733212 3733211 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
      md5sum 3733215 3733214 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      awk 3733216 3733214 0 /usr/bin/awk {print $1}

      sleep 3733217 3709 0 /usr/bin/sleep 1
      runc 3733218 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664fd437fe6-f5b3-4f39-86b7-efb35ddcaa25 --process /tmp/exec-process-131250605 612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664
      exe 3733226 3733218 0 /proc/self/exe init
      sh 3733230 3733218 0 /bin/sh -c declare -r health_endpoint="https://localhost:2379/health"\ndeclare -r cert="/var/run/secrets/etcd-client/tls.crt"\ndeclare -r key
      grep 3733238 3733230 0 /usr/bin/grep "health":"true"
      curl 3733237 3733230 0 /usr/bin/curl --max-time 2 --silent --cert /var/run/secrets/etcd-client/tls.crt --key /var/run/secrets/etcd-client/tls.key --cacert /var/run/configmaps/etcd-ca/ca-bundle.crt https://localhost:2379/health
      ^CTraceback (most recent call last):
      File "execsnoop.py", line 305, in <module>
      File "bcc/_init_.py", line 1445, in perf_buffer_poll
      KeyboardInterrupt

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "execsnoop.py", line 307, in <module>
      NameError: name 'exit' is not defined
      [3732956] Failed to execute script 'execsnoop' due to unhandled exception!
      [root@rdr-cicd-e6b7-mon01-master-1 execsnoop]# for i in `ls -t /run/crio/exec-pid-dir|head `; do cat /run/crio/exec-pid-dir/$i; echo ' '; done
      3733230
      3733197
      3733173
      3733131
      3733095
      3733070
      3733017
      3732984
      3732958
      3732896

      — Additional comment from Manoj Kumar on 2021-08-30 13:08:10 UTC —

      This is being reported with 4.8.5 as well. i.e. Potential to be hit by customers who upgrade to the most recent release.

      — Additional comment from Manoj Kumar on 2021-08-30 16:48:09 UTC —

      @prashanth found that this issue was introduced by
      https://github.com/cri-o/cri-o/pull/5136

      And it is fixed/reverted by:
      https://github.com/cri-o/cri-o/pull/5245
      https://github.com/cri-o/cri-o/pull/5262

            dgilmore.fedora Dennis Gilmore
            manoj5 Manoj Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: