Uploaded image for project: 'Red Hat Advanced Cluster Security'
  1. Red Hat Advanced Cluster Security
  2. ROX-30946

Fix corrupted health data persistence in upgrade scenarios for modern sensors

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • None
    • None
    • Sensor
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Rox Sprint 4.10C, Rox Sprint 4.10D

      Problem Summary

      The original fix for ROX-5257 (preventing manager from overwriting health pipeline updates) only prevented new race conditions but didn't address existing corrupted health data from previous versions, causing upgrade test failures.

      Root Cause

      Upgrade Test Scenario:
      1. StackRox 4.6.2 deploys first → creates corrupted health data with ancient timestamps (e.g., March 2021)
      2. Upgrades to current build with original fix → corrupted data persists indefinitely
      3. Tests expecting HEALTHY status fail because LastContact shows 4+ year old timestamps

      The Gap: Original fix prevented new overwrites but didn't detect/fix existing corruption.

      Impact

      • CI Failures: All upgrade tests fail on waitForClusterHealthy() assertions
      • False Unhealthy Status: Production clusters show as UNHEALTHY despite functioning properly
      • Monitoring Accuracy: Health dashboards display incorrect cluster status
      • Operational Confusion: Teams investigate "unhealthy" clusters that are actually healthy

      Technical Details

      Failing CI Build

      • Build ID: 1968321250602258432
      • Error: Cluster health status shows seconds: 1617138742 (March 30, 2021)
      • Expected: Recent timestamp indicating HEALTHY status
      • Sensor: 4.6.2 → current build upgrade scenario

      Code Location

      • File: central/sensor/service/connection/manager_impl.go
      • Function: updateClusterHealthForever()
      • Issue: Modern sensors with HealthInfoComplete=true don't get health updates, leaving corrupted data untouched

      Solution Implemented

      1. Corrupted Data Detection (fixCorruptedHealthDataIfNeeded)

      // Detects timestamps older than 7 days for modern sensors
      if time.Since(lastContact) > 7_24_time.Hour {
          // Update with current timestamp and HEALTHY status
      }
      

      2. Enhanced Health Check Logic

      • No Connection + Modern Sensor: Check for corrupted data
      • Active Connection + Modern Sensor: Check for corrupted data
      • Legacy Sensors: Preserve existing behavior

      3. Conservative Threshold

      • 7 days: Avoids false positives while catching genuine corruption
      • Logging: Warns when corruption is detected and fixed

      Testing

      New Test Case

      TestUpdateClusterHealthForever_ModernSensorWithCorruptedHealthData:

      • Simulates March 2021 timestamp (matches CI failure)
      • Verifies corruption detection and remediation
      • Ensures no regression in existing functionality

      Verification

      • All existing tests pass
      • New test demonstrates fix effectiveness
      • Upgrade scenarios now handle corrupted data properly

      Files Modified

        1. central/sensor/service/connection/manager_impl.go
      • Added fixCorruptedHealthDataIfNeeded() function
      • Enhanced updateClusterHealthForever() logic

      2. central/sensor/service/connection/manager_test.go

      • Added comprehensive test coverage for corruption detection

      Related Issues

      • Original Implementation: ROX-5257 (August 2020)
      • Original Fix: Current branch addressing race condition
      • This Fix: Addresses persistent corrupted data from upgrades

      Validation

      The fix successfully resolves the upgrade test failures while maintaining all existing functionality and preventing future corruption.

              tjanisze@redhat.com Tomasz Janiszewski
              tjanisze@redhat.com Tomasz Janiszewski
              ACS Sensor & Ecosystem
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: