-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.18.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Issue:
VG-Manager enters infinite retry loop when TwistLock privileged storage scanning interferes with LVM lock file operations at /var/lock/vgmanager/vgmanager.lock. Every retry attempt generates INFO-level log message with NO rate limiting or retry limits: 2025-08-07T15:26:26.624633544+00:00 stderr F {"level":"info","ts":"2025-08-07T15:26:26Z","msg":"Waiting for lock to be released","lockFile":"/var/lock/vgmanager/vgmanager.lock"}
Description of problem:
OpenShift Data Foundation VG-Manager experiences catastrophic lock contention logging when triggered by TwistLock/Prisma Cloud Defender pod restarts. REPRODUCTION CONFIRMED: August 12, 2025 - TwistLock Defender restart immediately triggered 259,221 identical log lines in 10 seconds (25,922 lines/second), causing: - 3.89 MB/second log generation rate - Complete disk space exhaustion (233GB logs in ~16 hours) - Node unresponsiveness requiring forced reboot - OVN database corruption across all 6 worker nodes - Cluster-wide networking failure
Steps to Reproduce:
1. Deploy OpenShift 4.18 with LVMS/ODF storage 2. Deploy TwistLock/Prisma Cloud Defender as DaemonSet with privileged security context 3. Restart TwistLock Defender pods: oc rollout restart daemonset/twistlock-defender-ds -n twistlock 4. Monitor VG-Manager logs immediately: watch "oc logs -f -n openshift-storage <vg-manager-pod>" 5. Monitor disk space in real-time: watch "df -h /" TRIGGER MECHANISM: - TwistLock Defender restart initiates privileged storage scanning - VG-Manager attempts LVM operations requiring exclusive lock - TwistLock storage scanning interferes with lock release mechanism - VG-Manager enters infinite retry loop logging "Waiting for lock to be released" - Log generation rate: 25,922 lines/second = 3.89 MB/second
Actual results:
] CATASTROPHIC SYSTEM FAILURE: Timeline Evidence (August 7, 2025 Original Incident): - T+0: TwistLock Defender pods restart - T+X hours: VG-Manager log explosion begins - Result: 233GB logs generated, 100% disk utilization on 2 nodes - Impact: OVN database corruption on ALL 6 worker nodes - Recovery: Manual cleanup and forced reboot required Reproduction Evidence (August 12, 2025): - T+0: TwistLock Defender restart initiated - T+10 seconds: 259,221 log lines generated - Rate: 25,922 lines/second sustained - Impact: Node became unresponsive, forced reboot required - Commands: oc exec to TwistLock pods hung (confirms lock contention) Log Pattern Sample: {"level":"info","ts":"2025-08-12T21:24:23Z","msg":"Waiting for lock to be released","lockFile":"/var/lock/vgmanager/vgmanager.lock"} [Repeated 259,221 times in 10 seconds] Disk Impact Analysis: - VG-Manager logs: 233GB (52.5% of total filesystem) - Total pod logs: 235GB (VG-Manager = 99.1% of all pod logs) - Log dominance: 233GB of 239GB total log directory
Expected results:
1. TwistLock Defender restarts should NOT interfere with LVMS operations 2. VG-Manager lock contention should implement: - Exponential backoff retry logic - Maximum retry count limits - Rate-limited logging (not INFO level for every retry) - Graceful failure handling without infinite loops 3. Log rotation should handle high-volume scenarios: - kubelet log rotation should enforce size limits - Log generation should never exceed 100MB per container - System should remain stable during storage scanning operations 4. No single pod should be capable of: - Consuming >50% of node disk space via logs - Causing cluster-wide networking failures - Requiring manual intervention for recovery
Additional info:
STORAGE CONFIGURATION: - LVMS deployed with default configuration (no custom lvmd.yaml) - No explicit log level configuration found for VG-Manager - Standard kubelet log rotation settings (ContainerLogMaxSize: 10Mi, ContainerLogMaxFiles: 5) - Log rotation completely bypassed by high-volume generation rate Attachments available from case 04225228: 1. must-gather from cluster during incident timeframe 2. sosreport from affected nodes (pre/post recovery) 3. du output showing 233GB VG-Manager log consumption 4. TwistLock Defender pod logs showing ODF image scanning errors