Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-66931

machine-controller memory leak (40 GB) due to excessive AWS region validation in mx-central-1

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.19.z
    • Unknown
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The machine-controller container within the machine-api-controllers pod exhibits a severe memory leak, consuming 40 GB of memory after 7 days of operation while managing only 25 machines. The leak appears to be caused by excessive AWS region validation API calls that are not properly cached or cleaned up.

      Version Information

      * OpenShift Version: 4.19.17
      * Component: machine-api-controllers (machine-controller container)
      * Container Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcce4eca4d62bc32292cf8c9d2f2211b04a649d7ebdeddaa37465810a18d9e5d
      * Cluster Type: Management Cluster (ROSA HCP)
      * Cloud Provider: AWS
      * Region: mx-central-1 (Mexico Central)

      Current State

      * Memory Usage: 40,701 Mi (~40 GB)
      * Memory Request: 20 Mi (no limit set)
      * Actual vs Request: 2,035x over configured request
      * Pod Uptime: 7 days (created 2025-11-28T04:55:13Z)
      * Node Memory Pressure: 93% (54,209 Mi / 58,240 Mi)
      * Machines Managed: 25 machines (all Running phase)
      * Container Restarts: 0

      Expected Behavior

      For 25 machines in normal operation, memory usage should be < 100 Mi.

      Actual Behavior

      Memory grows continuously at ~5.7 GB/day, reaching 40 GB after 7 days.

      Root Cause Analysis

      Key Finding: Excessive AWS Region Validation

      Analysis of 24-hour logs reveals:
      * Region validation calls: 7,484 events
      * Frequency: 1 call every ~11.5 seconds
      * Reconciliations: 250 events
      * Ratio: ~30 region validations per reconciliation

      Log Pattern

      I1205 23:06:43.152640       1 client.go:333\] Region mx\-central\-1 is not recognized by aws\-sdk, trying to validate using API
      I1205 23:06:43.268401       1 client.go:310\] AWS reports region mx\-central\-1 is valid
      

      This pattern repeats continuously throughout pod lifetime.

      Memory Leak Hypothesis

      The memory leak is caused by:

      No Region Validation Caching: AWS SDK does not recognize mx-central-1 as a known region, triggering API validation on every AWS client creation

      AWS Client Leaks: Each validation creates new AWS SDK clients/connections that are not properly closed or garbage collected

      Goroutine Leaks: Validation routines may not be properly cleaned up after API calls

      Response Object Accumulation: Validation API responses accumulate in memory without proper cleanup

      Supporting Evidence

      * Normal reconciliation rate (~0.4 per machine/hour) suggests controller logic is healthy
      * Memory growth correlates directly with region validation frequency
      * No stuck machines or abnormal machine states
      * No errors in logs beyond the region validation warnings

      Steps to Reproduce

      Deploy OpenShift 4.19.17 cluster in AWS region mx-central-1

      Allow machine-api-controllers pod to run for 7 days

      Monitor memory usage of machine-controller container

      Observe logs for "Region mx-central-1 is not recognized" messages

      Impact

      * Severity: High
      * Control plane node memory pressure (93% utilization)
      * Risk of OOM kill on control plane node
      * Potential cluster instability if memory exhausted
      * Impact on etcd and kube-apiserver co-located on same nodes

      Workaround

      Restart the machine-api-controllers pod periodically to release accumulated memory:

      oc delete pod  \-n openshift\-machine\-api
      

      Recommendations for Fix

      Update AWS SDK: Ensure mx-central-1 is in the known regions list of the AWS SDK version used

      Implement Region Caching: Cache region validation results for the lifetime of the controller

      Fix AWS Client Lifecycle: Ensure AWS SDK clients are properly reused or closed after operations

      Add Memory Profiling: Include pprof endpoints to diagnose memory leaks in production

      Add Memory Limits: Set appropriate memory limits on the container to prevent node-level impact (after fixing the leak)

      Additional Context

      * Feature Gates: AzureWorkloadIdentity=true,GCPLabelsTags=true,MachineAPIMigration=false,VSphereHostVMGroupZonal=false,VSphereMultiDisk=false
      * Leader Election: Enabled (120s lease duration)
      * Log Verbosity: v=3
      * Cluster Namespaces: 130 total

      References

      * PagerDuty Incident: https://redhat.pagerduty.com/incidents/Q2W8ZDD1GENE9U
      * Alert: ExtremelyHighIndividualControlPlaneMemory
      * Cluster ID: ec95db5a-7bd7-41b2-8c0f-acd028b1ba2f
      * Management Cluster: hs-mc-c73cj8fag

      Evidence Files

      Evidence archive available at: /tmp/ocpbugs-machine-controller-leak-evidence.tar.gz (415K)

              Unassigned Unassigned
              drow.openshift.srep Dustin Row
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: