-
Bug
-
Resolution: Done
-
Major
-
Global Hub 1.7.0
-
Quality / Stability / Reliability
-
0.5
-
False
-
-
False
-
-
-
GH Train-34, GH Train-35
-
None
Description of problem:
The integration test "should delete the ServiceMonitor when mgh deleted" was flaky with a 75% pass rate. The test failed intermittently due to a race condition caused by the two-phase cleanup behavior in the manager reconciler's pruneResources function.
Version-Release number of selected component (if applicable):
Global Hub 1.7.0
How reproducible:
Intermittent - approximately 75% pass rate before fix
Steps to Reproduce:
- Run the integration test "should delete the ServiceMonitor when mgh deleted"
- The test calls reconcile() once after deleting the MulticlusterGlobalHub instance
- Observe that ServiceMonitor deletion may not occur in the first reconciliation
Actual results:
The test failed intermittently because:
- Phase 1: pruneResources deletes ManagedClusterMigrations and returns early
- Phase 2: On next reconciliation (automatic rescheduling), ServiceMonitor gets deleted
- The test only called reconcile() once and relied on controller's automatic rescheduling, creating a race condition
Expected results:
ServiceMonitor should be deleted in a single reconciliation call, eliminating the need for multiple reconciliation cycles.
Additional info:
- Root cause: Early return in pruneResources prevented ServiceMonitor cleanup when migrations existed
- Fix: Modified pruneResources to delete both migrations and ServiceMonitor in a single reconciliation call (PR #2131)
- Impact: Integration test flakiness and inefficient resource cleanup (two reconciliations instead of one)
- Test results after fix: 4/4 test runs passed (100% success rate)
🤖 Generated with Claude Code