XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.19.z
Component/s: Unknown
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

The machine-controller container within the machine-api-controllers pod exhibits a severe memory leak, consuming 40 GB of memory after 7 days of operation while managing only 25 machines. The leak appears to be caused by excessive AWS region validation API calls that are not properly cached or cleaned up.

Version Information

* OpenShift Version: 4.19.17
* Component: machine-api-controllers (machine-controller container)
* Container Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcce4eca4d62bc32292cf8c9d2f2211b04a649d7ebdeddaa37465810a18d9e5d
* Cluster Type: Management Cluster (ROSA HCP)
* Cloud Provider: AWS
* Region: mx-central-1 (Mexico Central)

Current State

* Memory Usage: 40,701 Mi (~40 GB)
* Memory Request: 20 Mi (no limit set)
* Actual vs Request: 2,035x over configured request
* Pod Uptime: 7 days (created 2025-11-28T04:55:13Z)
* Node Memory Pressure: 93% (54,209 Mi / 58,240 Mi)
* Machines Managed: 25 machines (all Running phase)
* Container Restarts: 0

Expected Behavior

For 25 machines in normal operation, memory usage should be < 100 Mi.

Actual Behavior

Memory grows continuously at ~5.7 GB/day, reaching 40 GB after 7 days.

Root Cause Analysis

Key Finding: Excessive AWS Region Validation

Analysis of 24-hour logs reveals:
* Region validation calls: 7,484 events
* Frequency: 1 call every ~11.5 seconds
* Reconciliations: 250 events
* Ratio: ~30 region validations per reconciliation

Log Pattern

I1205 23:06:43.152640       1 client.go:333\] Region mx\-central\-1 is not recognized by aws\-sdk, trying to validate using API
I1205 23:06:43.268401       1 client.go:310\] AWS reports region mx\-central\-1 is valid

This pattern repeats continuously throughout pod lifetime.

Memory Leak Hypothesis

The memory leak is caused by:

No Region Validation Caching: AWS SDK does not recognize mx-central-1 as a known region, triggering API validation on every AWS client creation

AWS Client Leaks: Each validation creates new AWS SDK clients/connections that are not properly closed or garbage collected

Goroutine Leaks: Validation routines may not be properly cleaned up after API calls

Response Object Accumulation: Validation API responses accumulate in memory without proper cleanup

Supporting Evidence

* Normal reconciliation rate (~0.4 per machine/hour) suggests controller logic is healthy
* Memory growth correlates directly with region validation frequency
* No stuck machines or abnormal machine states
* No errors in logs beyond the region validation warnings

Steps to Reproduce

Deploy OpenShift 4.19.17 cluster in AWS region mx-central-1

Allow machine-api-controllers pod to run for 7 days

Monitor memory usage of machine-controller container

Observe logs for "Region mx-central-1 is not recognized" messages

Impact

* Severity: High
* Control plane node memory pressure (93% utilization)
* Risk of OOM kill on control plane node
* Potential cluster instability if memory exhausted
* Impact on etcd and kube-apiserver co-located on same nodes

Workaround

Restart the machine-api-controllers pod periodically to release accumulated memory:

oc delete pod  \-n openshift\-machine\-api

Recommendations for Fix

Update AWS SDK: Ensure mx-central-1 is in the known regions list of the AWS SDK version used

Implement Region Caching: Cache region validation results for the lifetime of the controller

Fix AWS Client Lifecycle: Ensure AWS SDK clients are properly reused or closed after operations

Add Memory Profiling: Include pprof endpoints to diagnose memory leaks in production

Add Memory Limits: Set appropriate memory limits on the container to prevent node-level impact (after fixing the leak)

Additional Context

* Feature Gates: AzureWorkloadIdentity=true,GCPLabelsTags=true,MachineAPIMigration=false,VSphereHostVMGroupZonal=false,VSphereMultiDisk=false
* Leader Election: Enabled (120s lease duration)
* Log Verbosity: v=3
* Cluster Namespaces: 130 total

References

* PagerDuty Incident: https://redhat.pagerduty.com/incidents/Q2W8ZDD1GENE9U
* Alert: ExtremelyHighIndividualControlPlaneMemory
* Cluster ID: ec95db5a-7bd7-41b2-8c0f-acd028b1ba2f
* Management Cluster: hs-mc-c73cj8fag

Evidence Files

Evidence archive available at: /tmp/ocpbugs-machine-controller-leak-evidence.tar.gz (415K)

Assignee:: Unassigned

Reporter:: Dustin Row

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/12/05 11:31 PM

Updated:: 2025/12/07 10:38 PM

Resolved:: 2025/12/07 10:37 PM

Details

Description