-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.13.0
-
No
-
MCO Sprint 232
-
1
-
False
-
Description of problem:
While performing a layered OS upgrade (either via an OpenShift e2e upgrade test) or by overriding the osImageURL field in a MachineConfig, numerous errors appear within the Machine Config Controller which resemble the following: I0216 15:11:38.328052 1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: Reporting unready: node ip-10-0-134-120.ec2.internal is reporting OutOfDisk=Unknown
Version-Release number of selected component (if applicable):
How reproducible:
Always.
Steps to Reproduce:
1. Create an OpenShift 4.12 or 4.13 cluster. 2. Either run the OpenShift e2e upgrade tests -or- create a MachineConfig that overrides osImageURL with a custom OS image. 3. Watch in the Machine Config Controller logs while the node is updating.
Actual results:
Eventually, you'll see a log entry that resembles: I0216 15:07:08.852067 1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfig", Namespace:"", Name:"rendered-infra-3c7916178f0d7ae4209b1aba41a33b79", UID:"", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OSImageURLOverridden' OSImageURL was overridden via machineconfig in rendered-infra-3c7916178f0d7ae4209b1aba41a33b79 (was: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fc9c9ccd5d76269ffff5672f4751ac5b390d759c9ad02dcf141ef5c6cce4a713 is: quay.io/zzlotnik/testing:4.12-8.6) I0216 15:11:38.328052 1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: Reporting unready: node ip-10-0-134-120.ec2.internal is reporting OutOfDisk=Unknown I0216 15:11:38.356883 1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: changed taints I0216 15:11:42.082709 1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: Reporting unready: node ip-10-0-134-120.ec2.internal is reporting Unschedulable I0216 15:11:42.104540 1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: changed taints However, this eventually clears and the node returns to service. Looking at the disk on the node, disk usage looks fine: $ sh-4.4# df -h | grep -v "container" | grep -v "kubelet" Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p4 120G 8.9G 111G 8% / tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup devtmpfs 7.7G 0 7.7G 0% /dev tmpfs 7.7G 0 7.7G 0% /dev/shm tmpfs 7.7G 48M 7.7G 1% /run tmpfs 7.7G 12K 7.7G 1% /tmp /dev/nvme0n1p3 350M 104M 224M 32% /boot
Expected results:
I would not have expected to see the OutOfDisk=Unknown indication.
Additional info:
- All credit for uncovering this goes to djoshy, who encountered this while looking at https://issues.redhat.com/browse/OCPBUGS-4820.
- I know we're not doing any kind of free space checks before we begin an OS upgrade, but it might be something worth considering.
- I initially thought it might have been because of this code path: https://github.com/openshift/machine-config-operator/blob/d22708eb4a26fb7c852a929affe1d50439b7e05f/pkg/daemon/update.go#L1658-L1711. This is because this particular path pulls the container using Podman but doesn't delete the container from the disk afterward. However, I don't think that's actually an issue because it appears that Podman and CRI-O share the same container storage space and (presumably) the same garbage collection routines (if they exist, that is).
- I also thought it could be this path https://github.com/openshift/machine-config-operator/blob/d22708eb4a26fb7c852a929affe1d50439b7e05f/pkg/daemon/update.go#L1946-L1961, where we extract the extensions container content to /tmp. However, it looks like we are cleaning up afterward so it's probably not that. Furthermore, looking at the /tmp dir on the affected node, I don't see anything in there, meaning that we (or another process) cleared it.
- It is entirely possible that this is a red herring since the Unknown state means that Kubernetes cannot decide if a resource is in a given condition or not (see: https://github.com/kubernetes/api/blob/v0.26.1/core/v1/types.go#L2567-L2575). If that's the case, we can close this bug as-is. It was just a bit disconcerting to see that, knowing that we don't have any pre-upgrade disk usage checks in place.
- relates to
-
OCPBUGS-4820 Controller version mismatch causing degradation during upgrades
- Closed
-
MCO-517 Prevent node availabilty check when the kubelet is shutdown
- Closed
-
MCO-516 Add a free space check before performing upgrades
- To Do