XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Sensor
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:
PX Impact Score:

Sprint:
Rox Sprint 4.10C, Rox Sprint 4.10D

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Problem Summary

The original fix for ROX-5257 (preventing manager from overwriting health pipeline updates) only prevented new race conditions but didn't address existing corrupted health data from previous versions, causing upgrade test failures.

Root Cause

Upgrade Test Scenario:
1. StackRox 4.6.2 deploys first → creates corrupted health data with ancient timestamps (e.g., March 2021)
2. Upgrades to current build with original fix → corrupted data persists indefinitely
3. Tests expecting HEALTHY status fail because LastContact shows 4+ year old timestamps

The Gap: Original fix prevented new overwrites but didn't detect/fix existing corruption.

Impact

CI Failures: All upgrade tests fail on waitForClusterHealthy() assertions
False Unhealthy Status: Production clusters show as UNHEALTHY despite functioning properly
Monitoring Accuracy: Health dashboards display incorrect cluster status
Operational Confusion: Teams investigate "unhealthy" clusters that are actually healthy

Technical Details

Failing CI Build

Build ID: 1968321250602258432
Error: Cluster health status shows seconds: 1617138742 (March 30, 2021)
Expected: Recent timestamp indicating HEALTHY status
Sensor: 4.6.2 → current build upgrade scenario

Code Location

File: central/sensor/service/connection/manager_impl.go
Function: updateClusterHealthForever()
Issue: Modern sensors with HealthInfoComplete=true don't get health updates, leaving corrupted data untouched

Solution Implemented

1. Corrupted Data Detection (`fixCorruptedHealthDataIfNeeded`)

// Detects timestamps older than 7 days for modern sensors
if time.Since(lastContact) > 7_24_time.Hour {
    // Update with current timestamp and HEALTHY status
}

2. Enhanced Health Check Logic

No Connection + Modern Sensor: Check for corrupted data
Active Connection + Modern Sensor: Check for corrupted data
Legacy Sensors: Preserve existing behavior

3. Conservative Threshold

7 days: Avoids false positives while catching genuine corruption
Logging: Warns when corruption is detected and fixed

Testing

New Test Case

TestUpdateClusterHealthForever_ModernSensorWithCorruptedHealthData:

Simulates March 2021 timestamp (matches CI failure)
Verifies corruption detection and remediation
Ensures no regression in existing functionality

Verification

All existing tests pass
New test demonstrates fix effectiveness
Upgrade scenarios now handle corrupted data properly

Files Modified

1. central/sensor/service/connection/manager_impl.go

Added fixCorruptedHealthDataIfNeeded() function
Enhanced updateClusterHealthForever() logic

2. central/sensor/service/connection/manager_test.go

Added comprehensive test coverage for corruption detection

Related Issues

Original Implementation: ROX-5257 (August 2020)
Original Fix: Current branch addressing race condition
This Fix: Addresses persistent corrupted data from upgrades

Validation

The fix successfully resolves the upgrade test failures while maintaining all existing functionality and preventing future corruption.

Assignee:: Tomasz Janiszewski

Reporter:: Tomasz Janiszewski

Team:: ACS Sensor & Ecosystem

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/09/18 12:16 PM

Updated:: 2025/11/25 10:26 AM

Resolved:: 2025/11/25 10:26 AM

Details

Description