Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-27591

Fix flaky integration tests caused by resource cleanup race condtions

XMLWordPrintable

    • Quality / Stability / Reliability
    • 1
    • False
    • Hide

      None

      Show
      None
    • False
    • GH Train-35
    • Low
    • None

      Problem Statement

      Three integration tests were failing intermittently due to race conditions during test cleanup, causing random CI failures.

      Version Found

      Latest main branch (commit b1d3db00)

      Is it reproducible?

      Intermittent - occurs during test cleanup when resources are being deleted

      Steps to Reproduce

      1. Run integration tests: make integration-test/agent and make integration-test/operator
      2. Tests occasionally fail during cleanup phase with race conditions

      Actual Results

      Test 1: Migration ConfigMap Conflict

      • Error: configmaps "multicluster-global-hub-agent-sync-state" already exists
      • Location: test/integration/agent/migration/migration_*_test.go
      • Test Result: 1 Failed

      Test 2: Manager Reconciler Panic

      • Error: runtime error: invalid memory address or nil pointer dereference at manager_reconciler.go:211
      • Location: operator/pkg/controllers/manager/manager_reconciler.go
      • Stacktrace shows nil MGH object access in defer function

      Test 3: Transport Offset Empty String

      • Error: Expected <string>: "" To satisfy matchers [test-topic-1, test-topic-2]
      • Location: test/integration/manager/status/transport_offset_test.go
      • Caused by querying ALL transport records instead of specific test data

      Expected Results

      All integration tests should pass reliably without race condition failures.

      Root Causes

      1. Migration ConfigMap Conflict

      • Two test suites (migration_from_syncer_test and migration_to_syncer_test) share global AgentConfig
      • Both try to create the same configmap multicluster-global-hub-agent-sync-state
      • No cleanup between test runs causes "already exists" errors

      2. Manager Reconciler Panic

      • During test cleanup, MGH resource gets deleted while controller is still reconciling
      • Defer function at line 208-217 tries to update status using mgh.Namespace
      • MGH is nil after deletion, causing nil pointer dereference

      3. Transport Offset Query

      • Test used db.Find(&positions) to query ALL transport records
      • Included old format records from migration tests without @partition suffix
      • Splitting by @ produces empty strings, failing assertions

      Resolution

      All three issues fixed by ensuring proper test isolation:

      Migration Tests (PR #2184)

      • Delete existing configmap in BeforeAll to ensure clean state
      • Use local namespace constants in AfterAll instead of global config

      Manager Reconciler (PR #2184)

      • Add nil check in defer function before accessing MGH object
      • Skip status update gracefully if MGH was deleted

      Transport Offset Test (PR #2185)

      • Query only the 4 specific records created by the test using WHERE clause
      • Add assertion to verify exactly 4 records found

      Related PRs

      Test Results

      Before

      • Migration tests: 18 Passed, 1 Failed (random)
      • Manager integration: Panic during cleanup
      • Status tests: 36 Passed, 1 Failed (random)

      After

      • Migration tests: 19/19 Passed
      • Manager integration: No panic
      • Status tests: 37/37 Passed

      Additional Information

      • Severity: Medium (affects CI reliability but not production)
      • All fixes follow defensive programming best practices
      • Common theme: Ensure test isolation by avoiding shared global state

      Generated with Claude Code

              rh-ee-myan Meng Yan
              rh-ee-myan Meng Yan
              Yaheng Liu Yaheng Liu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: