Goal
- In day-2 task if /var/lib/containers is mounted separately on different device by using machineconfig then node object is not reflecting the additional amount of storage at Capacity.ephemeral-storage and Allocatable.ephemeral-storage.
- In case of /var/lib/containers is mounted on different device keeping root filesystem (which will include /var/lib/kubelet) separately, to avoid filling /var/lib/containers as well as /var/lib/kubelet, quota applied at project level is not respecting the increase in size at /var/lib/containers and only evicting the pod if the limit for ephemeral-storage is crossed in /var/lib/kubelet.
Why is this important?
- In upstream kubernetes docs of two filesystem sections, I see that having local ephemeral storage on a node with two filesystems is supported.
- This is important in case container writes into some /tmp of container's filesystem which is not mounted as EmptyDir on node then space of /var/lib/containers will be consumed and separate partition may get 100% full if node resource is not monitoring changes and evicting the pods. In such cases, it's not known whether node will go into Not ready if separately mounted /var/lib/containers is 100% full.
Scenarios
- After mounting /var/lib/containers separately by following KCS doc, I tried checking ephemeral storage space of node.
# lsblk
sda4 8:4 0 119.5G 0 part /sysroot
sdb x:x 40G part /var/lib/containers
It is not reflecting in Node resource:
# oc describe node node-name | grep storage ephemeral-storage: 125293548Ki ephemeral-storage: 114396791822
Applied resourcequota for one project for ephemeral-storage and deployed one pod on the same node.
# oc describe quota Name: compute-resources Namespace: testdd Resource Used Hard ---- ---- limits.ephemeral-storage 1Gi 1Gi requests.ephemeral-storage 1Gi 1Gi - mysql-2-2krf4 1/1 Running 0 5m10s 10.129.2.8 node-name <none> <none>
Tried creating 1Gi of file going beyond the limits of quota with dd command within pods emptyDir location on the node:
# cd /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~empty-dir/mysql-data/ # dd if=/dev/zero of=1g.bin bs=1G count=1 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.04763 s, 1.0 GB/s # du -h 160K ./#innodb_temp 32K ./mysql 1.6M ./performance_schema 80K ./sys 0 ./sampledb 1.7G .
After this pod got evicted as expected.
mysql-2-2krf4 0/1 Evicted 0 5m49s <none> node-name <none> <none>
To check if pod gets evicted after creating 1G file at/var/lib/containers
On the node find out the rootfilesystem of container:
crictl inspect <container-id> | grep -i "root" -A 2 "root": { "path": "/var/lib/containers/storage/overlay/e5ce4dfe909922ec65dabb86cbc84521d5e0dec21a547d31272330cade09e5af/merged" }
On node:
# cd /var/lib/containers/storage/overlay/e5ce4dfe909922ec65dabb86cbc84521d5e0dec21a547d31272330cade09e5af/merged # ls bin boot dev etc help.1 home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var # ls tmp/ 11-sample.txt ks-script-1ivkqzo2 ks-script-heymndnb # df /var/lib/containers/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb 41922560 3902184 38020376 10% /var/lib/containers # dd if=/dev/zero of=1g.bin bs=1G count=1 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.2062 s, 890 MB/s # df /var/lib/containers/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb 41922560 4950760 36971800 12% /var/lib/containers
In this case, pod did not get evicted and stayed in running state without respecting quota limits.
So, from the observation, it seems that the ephemeral-storage for node is not considering combined size of root filesystem and disk size added for /var/lib/containers. Also, limits specified for ephemeral-storage for pod is not considering increase in size of /var/lib/containers and not evicting the pod.
Note: After deletion of node object and adding it back did not make any difference in size of ephemeral-storage.
Customer Justification / Notes
tmpfs wasn't considered because the space is too small. We are looking at using 1.5-3+TB of ephemeral storage on a system where there is only 384GB memory; tmpfs would typically be half that as well.
Chris and I had a meeting with one of the leads working on LSO, we deemed that after testing not to be applicable due to its lack of dynamic provisioning support; there is only allowed 1 PVC per PV, and we'd have to manage partitions in a very rigid and un-cloudlike fashion to support scaling number of pods.
hostPath as mentioned has the issue of security and SCCs, in addition to the fact that we'd have to write jobs to clean up and free the underlying storage when pods are evicted.
Therefore, we were left with ephemeral storage backed by block device. In the case of emptyDir, we found together with Aditya in the case and through testing that we cannot remount /var/lib/kubelet separately from /var or else Kubelet will not work correctly. If we mount /var separately in its entirety, then we have the problem that we cannot resize the backing storage on day 2 behind emptyDir, which is one of our requirements.
Therefore we have to use ephemeral storage by container filesystem mount such as /tmp. This space comes from /var/lib/containers. K8s supports requests and limits for ephemeral storage, which is one of our requirements because we cannot be having pods growing in size out of control and causing other pods (some cluster-critical, some application-critical) to be evicted. So the issue is that while IKS/upstream Kubernetes supports requests and limits for ephemeral-storage, OpenShift only does so for emptyDir (/var/lib/kubelet) itself. This does not cover our use of local ephemeral storage (/var/lib/containers, which within pod could be /tmp or other non-"volume" storage).
So there is a deficiency/issue with OpenShift vs. IKS/Kubernetes and it is in the area that we require.
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
Dependencies (internal and external)
Previous Work (Optional):
Open questions::
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- is related to
-
OCPSTRAT-1592 Support for Configuring Additional Disks During OpenShift Installation - Phase I
- New
-
OCPSTRAT-1065 Enhancing Storage Capacity for Kubelet in an Existing OpenShift Cluster
- New
- relates to
-
OCPSTRAT-188 Split filesystem and make each partition first class citizen for kubelet
- In Progress