Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Logical Volume Manager Storage
Labels:
- disk_usage,
- lvms
- odf
- triaged
- twistlock

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Issue:

VG-Manager enters infinite retry loop when TwistLock privileged storage  scanning interferes with LVM lock file operations at /var/lock/vgmanager/vgmanager.lock.

Every retry attempt generates INFO-level log message with NO rate limiting or retry limits:  

2025-08-07T15:26:26.624633544+00:00 stderr F {"level":"info","ts":"2025-08-07T15:26:26Z","msg":"Waiting for lock to be released","lockFile":"/var/lock/vgmanager/vgmanager.lock"}

Description of problem:

OpenShift Data Foundation VG-Manager experiences catastrophic 
lock contention logging when triggered by TwistLock/Prisma Cloud Defender pod restarts.

REPRODUCTION CONFIRMED: August 12, 2025 - TwistLock Defender restart immediately 
triggered 259,221 identical log lines in 10 seconds (25,922 lines/second), causing:
- 3.89 MB/second log generation rate
- Complete disk space exhaustion (233GB logs in ~16 hours)
- Node unresponsiveness requiring forced reboot
- OVN database corruption across all 6 worker nodes
- Cluster-wide networking failure

Steps to Reproduce:

1. Deploy OpenShift 4.18 with LVMS/ODF storage
2. Deploy TwistLock/Prisma Cloud Defender as DaemonSet with privileged security context
3. Restart TwistLock Defender pods: 
   oc rollout restart daemonset/twistlock-defender-ds -n twistlock
4. Monitor VG-Manager logs immediately:
   watch "oc logs -f -n openshift-storage <vg-manager-pod>"
5. Monitor disk space in real-time:
   watch "df -h /"

TRIGGER MECHANISM:
- TwistLock Defender restart initiates privileged storage scanning
- VG-Manager attempts LVM operations requiring exclusive lock
- TwistLock storage scanning interferes with lock release mechanism
- VG-Manager enters infinite retry loop logging "Waiting for lock to be released"
- Log generation rate: 25,922 lines/second = 3.89 MB/second

Actual results:

]
CATASTROPHIC SYSTEM FAILURE:

Timeline Evidence (August 7, 2025 Original Incident):
- T+0: TwistLock Defender pods restart
- T+X hours: VG-Manager log explosion begins
- Result: 233GB logs generated, 100% disk utilization on 2 nodes
- Impact: OVN database corruption on ALL 6 worker nodes
- Recovery: Manual cleanup and forced reboot required

Reproduction Evidence (August 12, 2025):
- T+0: TwistLock Defender restart initiated
- T+10 seconds: 259,221 log lines generated
- Rate: 25,922 lines/second sustained
- Impact: Node became unresponsive, forced reboot required
- Commands: oc exec to TwistLock pods hung (confirms lock contention)

Log Pattern Sample:
{"level":"info","ts":"2025-08-12T21:24:23Z","msg":"Waiting for lock to be released","lockFile":"/var/lock/vgmanager/vgmanager.lock"}
[Repeated 259,221 times in 10 seconds]

Disk Impact Analysis:
- VG-Manager logs: 233GB (52.5% of total filesystem)
- Total pod logs: 235GB (VG-Manager = 99.1% of all pod logs)
- Log dominance: 233GB of 239GB total log directory

Expected results:

 1. TwistLock Defender restarts should NOT interfere with LVMS operations

2. VG-Manager lock contention should implement:
   - Exponential backoff retry logic
   - Maximum retry count limits
   - Rate-limited logging (not INFO level for every retry)
   - Graceful failure handling without infinite loops

3. Log rotation should handle high-volume scenarios:
   - kubelet log rotation should enforce size limits
   - Log generation should never exceed 100MB per container
   - System should remain stable during storage scanning operations

4. No single pod should be capable of:
   - Consuming >50% of node disk space via logs
   - Causing cluster-wide networking failures
   - Requiring manual intervention for recovery

Additional info:

STORAGE CONFIGURATION:
- LVMS deployed with default configuration (no custom lvmd.yaml)
- No explicit log level configuration found for VG-Manager
- Standard kubelet log rotation settings (ContainerLogMaxSize: 10Mi, ContainerLogMaxFiles: 5)
- Log rotation completely bypassed by high-volume generation rate

Attachments available from case 04225228:

1. must-gather from cluster during incident timeframe
2. sosreport from affected nodes (pre/post recovery)
3. du output showing 233GB VG-Manager log consumption
4. TwistLock Defender pod logs showing ODF image scanning errors

is cloned by

OCPBUGS-63398 [4.21][BUG] TwistLock Defender restarts trigger catastrophic VG-Manager lock contention causing 26,000 log lines/second until disk space exhaustion and cluster-wide failure

Release Pending

Assignee:: Bulat Zamalutdinov

Reporter:: Abdullah Sikder

Need Info From:: None

Contributors:: None

QA Contact:: Minal Pradeep Makwana

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/08/14 7:34 AM

Updated:: 2025/11/27 8:43 AM

Resolved:: 2025/11/27 8:43 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide