Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13.0
Component/s: Machine Config Operator
Labels:
- mco-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
1
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
MCO Sprint 232
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

While performing a layered OS upgrade (either via an OpenShift e2e upgrade test) or by overriding the osImageURL field in a MachineConfig, numerous errors appear within the Machine Config Controller which resemble the following:

I0216 15:11:38.328052       1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: Reporting unready: node ip-10-0-134-120.ec2.internal is reporting OutOfDisk=Unknown

Version-Release number of selected component (if applicable):

How reproducible:

Always.

Steps to Reproduce:

1. Create an OpenShift 4.12 or 4.13 cluster.
2. Either run the OpenShift e2e upgrade tests -or- create a MachineConfig that overrides osImageURL with a custom OS image.
3. Watch in the Machine Config Controller logs while the node is updating.

Actual results:

Eventually, you'll see a log entry that resembles:

I0216 15:07:08.852067       1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfig", Namespace:"", Name:"rendered-infra-3c7916178f0d7ae4209b1aba41a33b79", UID:"", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OSImageURLOverridden' OSImageURL was overridden via machineconfig in rendered-infra-3c7916178f0d7ae4209b1aba41a33b79 (was: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fc9c9ccd5d76269ffff5672f4751ac5b390d759c9ad02dcf141ef5c6cce4a713 is: quay.io/zzlotnik/testing:4.12-8.6)
I0216 15:11:38.328052       1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: Reporting unready: node ip-10-0-134-120.ec2.internal is reporting OutOfDisk=Unknown
I0216 15:11:38.356883       1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: changed taints
I0216 15:11:42.082709       1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: Reporting unready: node ip-10-0-134-120.ec2.internal is reporting Unschedulable
I0216 15:11:42.104540       1 node_controller.go:446] Pool infra[zone=us-east-1a]: node ip-10-0-134-120.ec2.internal: changed taints

However, this eventually clears and the node returns to service. Looking at the disk on the node, disk usage looks fine:

$ sh-4.4# df -h | grep -v "container" | grep -v "kubelet"
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  120G  8.9G  111G   8% /
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
devtmpfs        7.7G     0  7.7G   0% /dev
tmpfs           7.7G     0  7.7G   0% /dev/shm
tmpfs           7.7G   48M  7.7G   1% /run
tmpfs           7.7G   12K  7.7G   1% /tmp
/dev/nvme0n1p3  350M  104M  224M  32% /boot

Expected results:

I would not have expected to see the OutOfDisk=Unknown indication.

Additional info:

All credit for uncovering this goes to djoshy, who encountered this while looking at https://issues.redhat.com/browse/OCPBUGS-4820.
I know we're not doing any kind of free space checks before we begin an OS upgrade, but it might be something worth considering.
I initially thought it might have been because of this code path: https://github.com/openshift/machine-config-operator/blob/d22708eb4a26fb7c852a929affe1d50439b7e05f/pkg/daemon/update.go#L1658-L1711. This is because this particular path pulls the container using Podman but doesn't delete the container from the disk afterward. However, I don't think that's actually an issue because it appears that Podman and CRI-O share the same container storage space and (presumably) the same garbage collection routines (if they exist, that is).
I also thought it could be this path https://github.com/openshift/machine-config-operator/blob/d22708eb4a26fb7c852a929affe1d50439b7e05f/pkg/daemon/update.go#L1946-L1961, where we extract the extensions container content to /tmp. However, it looks like we are cleaning up afterward so it's probably not that. Furthermore, looking at the /tmp dir on the affected node, I don't see anything in there, meaning that we (or another process) cleared it.
It is entirely possible that this is a red herring since the Unknown state means that Kubernetes cannot decide if a resource is in a given condition or not (see: https://github.com/kubernetes/api/blob/v0.26.1/core/v1/types.go#L2567-L2575). If that's the case, we can close this bug as-is. It was just a bit disconcerting to see that, knowing that we don't have any pre-upgrade disk usage checks in place.

relates to

OCPBUGS-4820 Controller version mismatch causing degradation during upgrades

Closed

MCO-517 Prevent node availabilty check when the kubelet is shutdown

Closed

MCO-516 Add a free space check before performing upgrades

To Do

Assignee:: David Joshy

Reporter:: Zack Zlotnik

Need Info From:: None

Contributors:: None

QA Contact:: Rio Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/02/16 3:27 PM

Updated:: 2025/09/02 6:14 PM

Resolved:: 2023/02/22 6:41 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates