-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.19.z
-
None
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The machine-controller container within the machine-api-controllers pod exhibits a severe memory leak, consuming 40 GB of memory after 7 days of operation while managing only 25 machines. The leak appears to be caused by excessive AWS region validation API calls that are not properly cached or cleaned up.
Version Information
* OpenShift Version: 4.19.17
* Component: machine-api-controllers (machine-controller container)
* Container Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcce4eca4d62bc32292cf8c9d2f2211b04a649d7ebdeddaa37465810a18d9e5d
* Cluster Type: Management Cluster (ROSA HCP)
* Cloud Provider: AWS
* Region: mx-central-1 (Mexico Central)
Current State
* Memory Usage: 40,701 Mi (~40 GB)
* Memory Request: 20 Mi (no limit set)
* Actual vs Request: 2,035x over configured request
* Pod Uptime: 7 days (created 2025-11-28T04:55:13Z)
* Node Memory Pressure: 93% (54,209 Mi / 58,240 Mi)
* Machines Managed: 25 machines (all Running phase)
* Container Restarts: 0
Expected Behavior
For 25 machines in normal operation, memory usage should be < 100 Mi.
Actual Behavior
Memory grows continuously at ~5.7 GB/day, reaching 40 GB after 7 days.
Root Cause Analysis
Key Finding: Excessive AWS Region Validation
Analysis of 24-hour logs reveals:
* Region validation calls: 7,484 events
* Frequency: 1 call every ~11.5 seconds
* Reconciliations: 250 events
* Ratio: ~30 region validations per reconciliation
Log Pattern
I1205 23:06:43.152640 1 client.go:333\] Region mx\-central\-1 is not recognized by aws\-sdk, trying to validate using API I1205 23:06:43.268401 1 client.go:310\] AWS reports region mx\-central\-1 is valid
This pattern repeats continuously throughout pod lifetime.
Memory Leak Hypothesis
The memory leak is caused by:
No Region Validation Caching: AWS SDK does not recognize mx-central-1 as a known region, triggering API validation on every AWS client creation
AWS Client Leaks: Each validation creates new AWS SDK clients/connections that are not properly closed or garbage collected
Goroutine Leaks: Validation routines may not be properly cleaned up after API calls
Response Object Accumulation: Validation API responses accumulate in memory without proper cleanup
Supporting Evidence
* Normal reconciliation rate (~0.4 per machine/hour) suggests controller logic is healthy
* Memory growth correlates directly with region validation frequency
* No stuck machines or abnormal machine states
* No errors in logs beyond the region validation warnings
Steps to Reproduce
Deploy OpenShift 4.19.17 cluster in AWS region mx-central-1
Allow machine-api-controllers pod to run for 7 days
Monitor memory usage of machine-controller container
Observe logs for "Region mx-central-1 is not recognized" messages
Impact
* Severity: High
* Control plane node memory pressure (93% utilization)
* Risk of OOM kill on control plane node
* Potential cluster instability if memory exhausted
* Impact on etcd and kube-apiserver co-located on same nodes
Workaround
Restart the machine-api-controllers pod periodically to release accumulated memory:
oc delete pod \-n openshift\-machine\-api
Recommendations for Fix
Update AWS SDK: Ensure mx-central-1 is in the known regions list of the AWS SDK version used
Implement Region Caching: Cache region validation results for the lifetime of the controller
Fix AWS Client Lifecycle: Ensure AWS SDK clients are properly reused or closed after operations
Add Memory Profiling: Include pprof endpoints to diagnose memory leaks in production
Add Memory Limits: Set appropriate memory limits on the container to prevent node-level impact (after fixing the leak)
Additional Context
* Feature Gates: AzureWorkloadIdentity=true,GCPLabelsTags=true,MachineAPIMigration=false,VSphereHostVMGroupZonal=false,VSphereMultiDisk=false
* Leader Election: Enabled (120s lease duration)
* Log Verbosity: v=3
* Cluster Namespaces: 130 total
References
* PagerDuty Incident: https://redhat.pagerduty.com/incidents/Q2W8ZDD1GENE9U
* Alert: ExtremelyHighIndividualControlPlaneMemory
* Cluster ID: ec95db5a-7bd7-41b2-8c0f-acd028b1ba2f
* Management Cluster: hs-mc-c73cj8fag
Evidence Files
Evidence archive available at: /tmp/ocpbugs-machine-controller-leak-evidence.tar.gz (415K)