Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60499

[BUG] TwistLock Defender restarts trigger catastrophic VG-Manager lock contention causing 26,000 log lines/second until disk space exhaustion and cluster-wide failure

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Issue:

      VG-Manager enters infinite retry loop when TwistLock privileged storage  scanning interferes with LVM lock file operations at /var/lock/vgmanager/vgmanager.lock.
      
      Every retry attempt generates INFO-level log message with NO rate limiting or retry limits:  
      
      2025-08-07T15:26:26.624633544+00:00 stderr F {"level":"info","ts":"2025-08-07T15:26:26Z","msg":"Waiting for lock to be released","lockFile":"/var/lock/vgmanager/vgmanager.lock"}

      Description of problem:

      OpenShift Data Foundation VG-Manager experiences catastrophic 
      lock contention logging when triggered by TwistLock/Prisma Cloud Defender pod restarts.
      
      REPRODUCTION CONFIRMED: August 12, 2025 - TwistLock Defender restart immediately 
      triggered 259,221 identical log lines in 10 seconds (25,922 lines/second), causing:
      - 3.89 MB/second log generation rate
      - Complete disk space exhaustion (233GB logs in ~16 hours)
      - Node unresponsiveness requiring forced reboot
      - OVN database corruption across all 6 worker nodes
      - Cluster-wide networking failure
       

      Steps to Reproduce:

      1. Deploy OpenShift 4.18 with LVMS/ODF storage
      2. Deploy TwistLock/Prisma Cloud Defender as DaemonSet with privileged security context
      3. Restart TwistLock Defender pods: 
         oc rollout restart daemonset/twistlock-defender-ds -n twistlock
      4. Monitor VG-Manager logs immediately:
         watch "oc logs -f -n openshift-storage <vg-manager-pod>"
      5. Monitor disk space in real-time:
         watch "df -h /"
      
      TRIGGER MECHANISM:
      - TwistLock Defender restart initiates privileged storage scanning
      - VG-Manager attempts LVM operations requiring exclusive lock
      - TwistLock storage scanning interferes with lock release mechanism
      - VG-Manager enters infinite retry loop logging "Waiting for lock to be released"
      - Log generation rate: 25,922 lines/second = 3.89 MB/second
      

      Actual results:

      ]
      CATASTROPHIC SYSTEM FAILURE:
      
      Timeline Evidence (August 7, 2025 Original Incident):
      - T+0: TwistLock Defender pods restart
      - T+X hours: VG-Manager log explosion begins
      - Result: 233GB logs generated, 100% disk utilization on 2 nodes
      - Impact: OVN database corruption on ALL 6 worker nodes
      - Recovery: Manual cleanup and forced reboot required
      
      Reproduction Evidence (August 12, 2025):
      - T+0: TwistLock Defender restart initiated
      - T+10 seconds: 259,221 log lines generated
      - Rate: 25,922 lines/second sustained
      - Impact: Node became unresponsive, forced reboot required
      - Commands: oc exec to TwistLock pods hung (confirms lock contention)
      
      Log Pattern Sample:
      {"level":"info","ts":"2025-08-12T21:24:23Z","msg":"Waiting for lock to be released","lockFile":"/var/lock/vgmanager/vgmanager.lock"}
      [Repeated 259,221 times in 10 seconds]
      
      Disk Impact Analysis:
      - VG-Manager logs: 233GB (52.5% of total filesystem)
      - Total pod logs: 235GB (VG-Manager = 99.1% of all pod logs)
      - Log dominance: 233GB of 239GB total log directory
       

      Expected results:

       1. TwistLock Defender restarts should NOT interfere with LVMS operations
      
      2. VG-Manager lock contention should implement:
         - Exponential backoff retry logic
         - Maximum retry count limits
         - Rate-limited logging (not INFO level for every retry)
         - Graceful failure handling without infinite loops
      
      3. Log rotation should handle high-volume scenarios:
         - kubelet log rotation should enforce size limits
         - Log generation should never exceed 100MB per container
         - System should remain stable during storage scanning operations
      
      4. No single pod should be capable of:
         - Consuming >50% of node disk space via logs
         - Causing cluster-wide networking failures
         - Requiring manual intervention for recovery
      

      Additional info:

      STORAGE CONFIGURATION:
      - LVMS deployed with default configuration (no custom lvmd.yaml)
      - No explicit log level configuration found for VG-Manager
      - Standard kubelet log rotation settings (ContainerLogMaxSize: 10Mi, ContainerLogMaxFiles: 5)
      - Log rotation completely bypassed by high-volume generation rate
      
      Attachments available from case 04225228:
      
      1. must-gather from cluster during incident timeframe
      2. sosreport from affected nodes (pre/post recovery)
      3. du output showing 233GB VG-Manager log consumption
      4. TwistLock Defender pod logs showing ODF image scanning errors
       

              bzamalut@redhat.com Bulat Zamalutdinov
              abdullahsikder Abdullah Sikder
              None
              None
              Minal Pradeep Makwana Minal Pradeep Makwana
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: