-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.14.0
-
Moderate
-
No
-
OCPNODE Sprint 239 (Blue), OCPNODE Sprint 240 (Blue)
-
2
-
Proposed
-
False
-
Opening this against Node/CRI-O, but it seems to be on conmon.
It should affect previous OCP versions too.
Description of problem:
When pods are deleted, conmon leaks broken symbolic links in /var/run/crio. Those symbolic links are never garbage collected, leading long-running nodes with a high-rate of pods creation & deletion cycles (e.g., jobs, cronjobs...) to fill up the available inodes with broken symlinks. This issue has been reported in https://github.com/okd-project/okd/discussions/1497
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-25-165218 sh-4.4# crictl version Version: 0.1.0 RuntimeName: cri-o RuntimeVersion: 1.26.1-6.rhaos4.13.git159cc9c.el8 RuntimeApiVersion: v1 sh-4.4# conmon --version conmon version 2.1.6 commit: d8e2824381519d3bc5944944670225c0b66e6e80 sh-4.4# runc --version runc version 1.1.4 spec: 1.0.2-dev go: go1.19.4 libseccomp: 2.5.2
How reproducible:
Always
Steps to Reproduce:
0. Ensure the worker is not getting other pods scheduled. For example, taint the node with ```shell NODE=node1 oc adm taint nodes $NODE conmon-bug=value:NoSchedule ``` 1. Create a debug pod to continuously monitor the broken symlinks: ```shell oc debug node/$NODE chroot /host watch 'find /run/crio -type l ! -readable | wc -l' ``` 2. Create a pod (use the proper toleration) in a user project: ```shell oc new-project my-project oc create -f pod.yaml ``` 3. Delete the created pod with `oc delete ...` 4. Wait for the garbage collection to be triggered by the kubelet (about 6 mins) 5. Check again the number of broken links as in point 1
Actual results:
The number of broken symlinks is nondecreasing.
Expected results:
The number of broken symlinks is 0 (or, weakly, about constant in time)
Additional info:
Instead of waiting at point 4, you can just delete the container with `crictl rm` in the node that hosts the pod. I also tried by applying a constant rate of pod creation/deletion and the number of broken symlinks was always increasing linearly. The broken symlink is created by `conmon` during the creation of the container and that link is not removed when the container lifecycle ends. ``` type=CWD msg=audit(1677322413.913:1809): cwd="/run/containers/storage/overlay-containers/dbe22e9dee1edf1919e6c592a69c967c62a9867a583e6cb2523e4a4ec07ee938/userdata" type=SYSCALL msg=audit(1677322413.913:1809): arch=c000003e syscall=87 success=no exit=-2 a0=5646135afd80 a1=33 a2=5646135afd80 a3=2b items=1 ppid=1 pid=1058162 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="conmon" exe="/usr/bin/conmon" subj=system_u:system_r:container_runtime_t:s0 key=(null) ---- time->Sat Feb 25 10:53:33 2023 type=PROCTITLE msg=audit(1677322413.913:1810): proctitle=2F7573722F62696E2F636F6E6D6F6E002D62002F72756E2F636F6E7461696E6572732F73746F726167652F6F7665726C61792D636F6E7461696E6572732F646265323265396465653165646631393139653663353932613639633936376336326139383637613538336536636232353233653461346563303765653933382F75 type=PATH msg=audit(1677322413.913:1810): item=2 name="/var/run/crio/dbe22e9dee1edf1919e6c592a69c967c62a9867a583e6cb2523e4a4ec07ee938" inode=12028801 dev=00:18 mode=0120777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:container_var_run_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(1677322413.913:1810): item=1 name="/var/run/crio/" inode=32000 dev=00:18 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:container_var_run_t:s0 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(1677322413.913:1810): item=0 name="/run/containers/storage/overlay-containers/dbe22e9dee1edf1919e6c592a69c967c62a9867a583e6cb2523e4a4ec07ee938/userdata" nametype=UNKNOWN cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(1677322413.913:1810): cwd="/run/containers/storage/overlay-containers/dbe22e9dee1edf1919e6c592a69c967c62a9867a583e6cb2523e4a4ec07ee938/userdata" type=SYSCALL msg=audit(1677322413.913:1810): arch=c000003e syscall=88 success=yes exit=0 a0=5646135b7490 a1=5646135afd80 a2=5646135afd80 a3=2b items=3 ppid=1 pid=1058162 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="conmon" exe="/usr/bin/conmon" subj=system_u:system_r:container_runtime_t:s0 key=(null) ---- time->Sat Feb 25 10:53:34 2023 ```
- clones
-
OCPBUGS-7962 Conmon leaks symbolic links in /var/run/crio when pods are deleted
- Closed