Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-27085

Flaky integration test: ServiceMonitor deletion test fails intermittently due to two-phase cleanup logic

XMLWordPrintable

    • Quality / Stability / Reliability
    • 0.5
    • False
    • Hide

      None

      Show
      None
    • False
    • GH Train-34, GH Train-35
    • None

      Description of problem:

      The integration test "should delete the ServiceMonitor when mgh deleted" was flaky with a 75% pass rate. The test failed intermittently due to a race condition caused by the two-phase cleanup behavior in the manager reconciler's pruneResources function.

      Version-Release number of selected component (if applicable):

      Global Hub 1.7.0

      How reproducible:

      Intermittent - approximately 75% pass rate before fix

      Steps to Reproduce:

      1. Run the integration test "should delete the ServiceMonitor when mgh deleted"
      2. The test calls reconcile() once after deleting the MulticlusterGlobalHub instance
      3. Observe that ServiceMonitor deletion may not occur in the first reconciliation

      Actual results:

      The test failed intermittently because:
      - Phase 1: pruneResources deletes ManagedClusterMigrations and returns early
      - Phase 2: On next reconciliation (automatic rescheduling), ServiceMonitor gets deleted
      - The test only called reconcile() once and relied on controller's automatic rescheduling, creating a race condition

      Expected results:

      ServiceMonitor should be deleted in a single reconciliation call, eliminating the need for multiple reconciliation cycles.

      Additional info:

      • Root cause: Early return in pruneResources prevented ServiceMonitor cleanup when migrations existed
      • Fix: Modified pruneResources to delete both migrations and ServiceMonitor in a single reconciliation call (PR #2131)
      • Impact: Integration test flakiness and inefficient resource cleanup (two reconciliations instead of one)
      • Test results after fix: 4/4 test runs passed (100% success rate)

      🤖 Generated with Claude Code

              rh-ee-myan Meng Yan
              rh-ee-myan Meng Yan
              Yaheng Liu Yaheng Liu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: