Loading...

Type: Bug
Resolution: Done
Fix Version/s: 4.8.z
Affects Version/s: 4.8
Component/s: Multi-Arch
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:
None
Story Points:
None
Severity:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Also seen with 4.8.5. Renders cluster unusable after 2 days on Power.

+++ This bug was initially created as a clone of Bug #1997062 +++

Issue is with below builds :
4.9.0-0.nightly-ppc64le-2021-08-17-145337
4.9.0-0.nightly-ppc64le-2021-08-19-120135

on bastion :

lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Model: 2.3 (pvr 004e 0203)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
NUMA node0 CPU(s): 0-7
Physical sockets: 2
Physical chips: 1
Physical cores/chip: 10

[core@master-0 ~]$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Model: 2.3 (pvr 004e 0203)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
NUMA node0 CPU(s): 0-7

No workload was deployed on the cluster.

oc get nodes
NAME STATUS ROLES AGE VERSION
master-0 Ready master 6d17h v1.22.0-rc.0+3dfed96
master-1 Ready master 6d17h v1.22.0-rc.0+3dfed96
master-2 Ready master 6d17h v1.22.0-rc.0+3dfed96
worker-0 Ready worker 6d17h v1.22.0-rc.0+3dfed96
worker-1 Ready worker 6d17h v1.22.0-rc.0+3dfed96

— Additional comment from Alisha on 2021-08-24 13:10:32 UTC —

[root@master-0 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 7.9G 0 7.9G 0% /dev
tmpfs 8.0G 256K 8.0G 1% /dev/shm
tmpfs 8.0G 7.9G 151M 99% /run
tmpfs 8.0G 0 8.0G 0% /sys/fs/cgroup
/dev/sda4 120G 17G 104G 14% /sysroot
tmpfs 8.0G 64K 8.0G 1% /tmp
/dev/sdb3 364M 233M 109M 69% /boot
overlay 8.0G 7.9G 151M 99% /etc/NetworkManager/systemConnectionsMerged

— Additional comment from Alisha on 2021-08-24 13:25:17 UTC —

Platform is ppc64le.

OS info :

on bastion :

cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

CoreOS nodes :
[core@master-0 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux CoreOS release 4.9

— Additional comment from Manoj Kumar on 2021-08-29 22:40:16 UTC —

I did some more digging. Found a tool to snoop on exec(). https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py

With a compiled version of the tool, I was able to correlate the new processes to the contents of /run/crio/exec-pid-dir:

https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py

[root@rdr-cicd-e6b7-mon01-master-1 execsnoop]# ./execsnoop
In file included from <built-in>:2:
In file included from /virtual/include/bcc/bpf.h:12:
In file included from include/linux/types.h:6:
In file included from include/uapi/linux/types.h:14:
In file included from include/uapi/linux/posix_types.h:5:
In file included from include/linux/stddef.h:5:
In file included from include/uapi/linux/stddef.h:2:
In file included from include/linux/compiler_types.h:74:
include/linux/compiler-clang.h:25:9: warning: '__no_sanitize_address' macro redefined [-Wmacro-redefined]
#define __no_sanitize_address
^
include/linux/compiler-gcc.h:213:9: note: previous definition is here
#define _no_sanitize_address __attribute_((no_sanitize_address))
^
1 warning generated.
PCOMM PID PPID RET ARGS
ldd 3733034 2553 0 /usr/bin/ldd /usr/bin/crio
ld64.so.2 3733035 3733034 0 /lib64/ld64.so.2 --verify /usr/bin/crio
ld64.so.2 3733038 3733037 0 /lib64/ld64.so.2 /usr/bin/crio
sh 3733039 5068 0 /usr/bin/awk -F = '/partition_id/

{ print $2 }' /proc/ppc64/lparcfg
awk 3733039 5068 0 /usr/bin/awk -F = /partition_id/ { print $2 }

/proc/ppc64/lparcfg
sh 3733040 5068 0 /usr/bin/awk -F = '/partition_id/

{ print $2 }' /proc/ppc64/lparcfg
awk 3733040 5068 0 /usr/bin/awk -F = /partition_id/ { print $2 }

/proc/ppc64/lparcfg
sh 3733041 5068 0 /usr/bin/awk -F = '/partition_id/

{ print $2 }' /proc/ppc64/lparcfg
awk 3733041 5068 0 /usr/bin/awk -F = /partition_id/ { print $2 }

/proc/ppc64/lparcfg
sh 3733042 5313 0 /usr/bin/awk -F = '/partition_id/

{ print $2 }' /proc/ppc64/lparcfg
awk 3733042 5313 0 /usr/bin/awk -F = /partition_id/ { print $2 }

/proc/ppc64/lparcfg
sh 3733043 5313 0 /usr/bin/awk -F = '/partition_id/

{ print $2 }' /proc/ppc64/lparcfg
awk 3733043 5313 0 /usr/bin/awk -F = /partition_id/ { print $2 }

/proc/ppc64/lparcfg
sh 3733044 5313 0 /usr/bin/awk -F = '/partition_id/

{ print $2 }' /proc/ppc64/lparcfg
awk 3733044 5313 0 /usr/bin/awk -F = /partition_id/ { print $2 }

/proc/ppc64/lparcfg
md5sum 3733046 3733045 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk 3733047 3733045 0 /usr/bin/awk

{print $1}
md5sum 3733049 3733048 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk 3733050 3733048 0 /usr/bin/awk {print $1}

sleep 3733051 3709 0 /usr/bin/sleep 1
md5sum 3733053 3733052 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk 3733054 3733052 0 /usr/bin/awk

{print $1}
md5sum 3733056 3733055 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk 3733057 3733055 0 /usr/bin/awk {print $1}

sleep 3733058 3709 0 /usr/bin/sleep 1
runc 3733059 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe36d95a00c-82c2-4fbe-b015-b2ab89cf3303 --process /tmp/exec-process-074617211 1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe3
exe 3733068 3733059 0 /proc/self/exe init
test 3733070 3733059 0 /usr/bin/test -f /etc/cni/net.d/80-openshift-network.conf
md5sum 3733077 3733076 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk 3733078 3733076 0 /usr/bin/awk

{print $1}
md5sum 3733080 3733079 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk 3733081 3733079 0 /usr/bin/awk {print $1}

sleep 3733082 3709 0 /usr/bin/sleep 1
runc 3733083 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace50931519341ee0-e912-4070-a4f8-14d9196f1352 --process /tmp/exec-process-197278878 afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace509315
exe 3733093 3733083 0 /proc/self/exe init
bash 3733095 3733083 0 /bin/bash -c set -xe\n\n# Unix sockets are used for health checks to ensure that the pod is reporting readiness of the etcd process\n# in this c
etcdctl 3733101 3733095 0 /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://193.168.200.231:0 endpoint health -w json
grep 3733102 3733095 0 /usr/bin/grep "health":true
awk 3733112 3733110 0 /usr/bin/awk

{print $1}
md5sum 3733111 3733110 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
md5sum 3733114 3733113 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk 3733115 3733113 0 /usr/bin/awk {print $1}

sleep 3733116 3709 0 /usr/bin/sleep 1
runc 3733117 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664df311a28-07fb-4c11-9bb9-9ccdf005adc2 --process /tmp/exec-process-392734080 612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664
exe 3733126 3733117 0 /proc/self/exe init
sh 3733131 3733117 0 /bin/sh -c declare -r health_endpoint="https://localhost:2379/health"\ndeclare -r cert="/var/run/secrets/etcd-client/tls.crt"\ndeclare -r key
grep 3733138 3733131 0 /usr/bin/grep "health":"true"
curl 3733137 3733131 0 /usr/bin/curl --max-time 2 --silent --cert /var/run/secrets/etcd-client/tls.crt --key /var/run/secrets/etcd-client/tls.key --cacert /var/run/configmaps/etcd-ca/ca-bundle.crt https://localhost:2379/health
md5sum 3733141 3733140 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk 3733142 3733140 0 /usr/bin/awk

{print $1}
md5sum 3733144 3733143 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk 3733145 3733143 0 /usr/bin/awk {print $1}

sleep 3733146 3709 0 /usr/bin/sleep 1
md5sum 3733148 3733147 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk 3733149 3733147 0 /usr/bin/awk

{print $1}
md5sum 3733151 3733150 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk 3733152 3733150 0 /usr/bin/awk {print $1}

sleep 3733153 3709 0 /usr/bin/sleep 1
md5sum 3733155 3733154 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk 3733156 3733154 0 /usr/bin/awk

{print $1}
md5sum 3733158 3733157 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk 3733159 3733157 0 /usr/bin/awk {print $1}

sleep 3733160 3709 0 /usr/bin/sleep 1
runc 3733161 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe37b92324c-0714-46c1-b517-3ab9786ca397 --process /tmp/exec-process-050607721 1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe3
exe 3733169 3733161 0 /proc/self/exe init
test 3733173 3733161 0 /usr/bin/test -f /etc/cni/net.d/80-openshift-network.conf
md5sum 3733180 3733179 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk 3733181 3733179 0 /usr/bin/awk

{print $1}
md5sum 3733183 3733182 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk 3733184 3733182 0 /usr/bin/awk {print $1}

sleep 3733185 3709 0 /usr/bin/sleep 1
runc 3733186 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace5093158e423623-9b1d-45d4-b3f3-63eea92c797b --process /tmp/exec-process-598901766 afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace509315
exe 3733195 3733186 0 /proc/self/exe init
bash 3733197 3733186 0 /bin/bash -c set -xe\n\n# Unix sockets are used for health checks to ensure that the pod is reporting readiness of the etcd process\n# in this c
etcdctl 3733203 3733197 0 /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://193.168.200.231:0 endpoint health -w json
grep 3733204 3733197 0 /usr/bin/grep "health":true
awk 3733213 3733211 0 /usr/bin/awk

{print $1}
md5sum 3733212 3733211 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
md5sum 3733215 3733214 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk 3733216 3733214 0 /usr/bin/awk {print $1}

sleep 3733217 3709 0 /usr/bin/sleep 1
runc 3733218 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664fd437fe6-f5b3-4f39-86b7-efb35ddcaa25 --process /tmp/exec-process-131250605 612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664
exe 3733226 3733218 0 /proc/self/exe init
sh 3733230 3733218 0 /bin/sh -c declare -r health_endpoint="https://localhost:2379/health"\ndeclare -r cert="/var/run/secrets/etcd-client/tls.crt"\ndeclare -r key
grep 3733238 3733230 0 /usr/bin/grep "health":"true"
curl 3733237 3733230 0 /usr/bin/curl --max-time 2 --silent --cert /var/run/secrets/etcd-client/tls.crt --key /var/run/secrets/etcd-client/tls.key --cacert /var/run/configmaps/etcd-ca/ca-bundle.crt https://localhost:2379/health
^CTraceback (most recent call last):
File "execsnoop.py", line 305, in <module>
File "bcc/_init_.py", line 1445, in perf_buffer_poll
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "execsnoop.py", line 307, in <module>
NameError: name 'exit' is not defined
[3732956] Failed to execute script 'execsnoop' due to unhandled exception!
[root@rdr-cicd-e6b7-mon01-master-1 execsnoop]# for i in `ls -t /run/crio/exec-pid-dir|head `; do cat /run/crio/exec-pid-dir/$i; echo ' '; done
3733230
3733197
3733173
3733131
3733095
3733070
3733017
3732984
3732958
3732896

— Additional comment from Manoj Kumar on 2021-08-30 13:08:10 UTC —

This is being reported with 4.8.5 as well. i.e. Potential to be hit by customers who upgrade to the most recent release.

— Additional comment from Manoj Kumar on 2021-08-30 16:48:09 UTC —

@prashanth found that this issue was introduced by
https://github.com/cri-o/cri-o/pull/5136

And it is fixed/reverted by:
https://github.com/cri-o/cri-o/pull/5245
https://github.com/cri-o/cri-o/pull/5262

external trackers

Github cri-o/cri-o/pull/5262

Red Hat Issue Tracker MULTIARCH-1648

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide