-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
None
Description of problem:
High IO for short duration(5min) on all worker nodes causes some VirtualMachineInstance to crash
Version-Release number of selected component (if applicable):
fence-agents-remediation.v0.6.0.yaml kubevirt-hyperconverged-operator.v4.18.23.yaml node-healthcheck-operator.v0.10.1.yaml
odf-operator.v4.18.14-rhodf.yaml
fence-agents-remediation.v0.6.0.yaml
mcg-operator.v4.18.14-rhodf.yaml
odf-csi-addons-operator.v4.18.14-rhodf.yaml
cephcsi-operator.v4.18.14-rhodf.yaml
recipe.v4.18.14-rhodf.yaml
ocs-operator.v4.18.14-rhodf.yaml
ocs-client-operator.v4.18.14-rhodf.yaml
odf-prometheus-operator.v4.18.14-rhodf.yaml
rook-ceph-operator.v4.18.14-rhodf.yaml
odf-dependencies.v4.18.14-rhodf.yaml
node-healthcheck-operator.v0.10.1.yaml
How reproducible:
run stress-ng with below values for 5 mins on all nodes
io-block-size: 4k
io-write-bytes: 2g
Steps to Reproduce:
1.run io hog test using krkn (https://github.com/krkn-chaos/krkn) use below config for the io-hog scenario duration: 300 workers: '' # leave it empty '' node cpu auto-detection hog-type: io image: quay.io/krkn-chaos/krkn-hog namespace: default io-block-size: 4k io-write-bytes: 2g io-target-pod-folder: /hog-data # node-name: "worker-0" # Uncomment to target a specific node by name io-target-pod-volume: name: node-volume hostPath: path: /root # a path writable by kubelet in the root filesystem of the node node-selector: "node-role.kubernetes.io/worker=" number-of-nodes: '' taints: [] #example ["node-role.kubernetes.io/master:NoSchedule"] ~
2. 3.
Actual results:
openshift-kni-infra 16m Warning Unhealthy pod/keepalived-e10-h27-000-r660 Liveness probe failed: command timed out virt-clone-clones 14m Warning Stopped virtualmachineinstance/clone-vm-0-108 The VirtualMachineInstance crashed.
Expected results:
VirtualMachineInstance should able to withstand high IO usage for short duration,
Additional info:
below is iostat output from one of the node, IO operations were done on sda(sda4), the root disk for the node
avg-cpu: %user %nice %system %iowait %steal %idle
3.50 0.00 86.68 0.08 0.00 9.74
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0c0n1 0.00 0.00 0.00 0.00 0.00 0.00 18.50 146.00 17.50 48.61 0.03 7.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.80
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 18.50 146.00 0.00 0.00 0.00 7.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.80
rbd0 0.00 0.00 0.00 0.00 0.00 0.00 3.00 523.00 0.00 0.00 1209.83 174.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.63 52.75
rbd1 0.00 0.00 0.00 0.00 0.00 0.00 3.00 12.00 0.00 0.00 679.50 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.04 83.70
rbd10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd12 0.00 0.00 0.00 0.00 0.00 0.00 1.50 522.00 0.00 0.00 2699.33 348.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.05 40.00
rbd13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd19 0.00 0.00 0.00 0.00 0.00 0.00 2.00 42.00 0.00 0.00 250.25 21.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 67.90
rbd2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd22 15.00 524.00 0.00 0.00 39.03 34.93 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 63.75
rbd23 0.00 0.00 0.00 0.00 0.00 0.00 1.00 26.00 0.00 0.00 1369.50 26.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.37 100.00
rbd3 0.00 0.00 0.00 0.00 0.00 0.00 0.50 512.00 0.00 0.00 1848.00 1024.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.92 100.00
rbd4 0.00 0.00 0.00 0.00 0.00 0.00 0.50 64.00 0.00 0.00 3047.00 128.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.52 26.15
rbd5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
rbd8 42.50 1564.00 0.00 0.00 9.94 36.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.42 36.05
rbd9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 1948.50 34840.00 72.00 3.56 0.07 17.88 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 12.40
sdb 1.50 6.00 0.00 0.00 0.00 4.00 482.50 3730.00 12.00 2.43 0.54 7.73 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.26 55.50
sdc 0.50 2.00 0.00 0.00 0.00 4.00 603.00 4756.00 11.00 1.79 0.56 7.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.34 65.10
Though the stress was done on sda, it utilization is quite low compared to rdb ones which has high utilization and high wait time. The disks that are used for ceph are sdb and sdc
ceph health check(ceph -s) did not indicate any "slow ops", when stress-ng was run for 5 min.