-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.21
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
Approved
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:
[Monitor:lease-checker][Jira:"Test Framework"] monitor test lease-checker interval construction
Significant regression detected.
Fishers Exact probability of a regression: 99.99%.
Test pass rate dropped from 100.00% to 87.50%.
Sample (being evaluated) Release: 4.21
Start Time: 2025-11-19T00:00:00Z
End Time: 2025-11-26T12:00:00Z
Success Rate: 87.50%
Successes: 21
Failures: 3
Flakes: 0
Base (historical) Release: 4.19
Start Time: 2025-05-18T00:00:00Z
End Time: 2025-06-17T00:00:00Z
Success Rate: 100.00%
Successes: 108
Failures: 0
Flakes: 0
View the test details report for additional context.
Summary
Monitor test lease-checker interval construction fails when duplicate LeaseAcquiringStarted events are logged
Description
The test [Monitor:lease-checker][Jira:"Test Framework"] monitor test lease-checker interval construction fails when a pod logs duplicate LeaseAcquiringStarted events for the same lease acquisition attempt.
Root Cause
The lease-checker monitor test has a logic flaw in how it handles duplicate lease acquisition events. When processing lease intervals:
- It expects alternating LeaseAcquiringStarted and LeaseAcquired events
- When it sees a LeaseAcquiringStarted event, it saves the timestamp (only if not already set)
- When it sees a LeaseAcquired event, it creates an interval and clears the saved timestamp
- If duplicate LeaseAcquiringStarted events exist, the second one is ignored
- If duplicate LeaseAcquired events exist, the second one fails because the timestamp was cleared after the first one
The pattern in this failure:
- Event 1: LeaseAcquiringStarted at 07:00:38.868 → lastAcquiringFrom set
- Event 2: LeaseAcquiringStarted at 07:00:38.868 (duplicate) → ignored (lastAcquiringFrom already set)
- Event 3: LeaseAcquired at 07:00:55.991 → interval created, lastAcquiringFrom cleared
- Event 4: LeaseAcquired at 07:00:55.991 (duplicate) → ERROR! lastAcquiringFrom is nil
Error Message
failed during interval construction missing acquiring stage for namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator: all intervals Nov 26 07:00:38.868 - 1s I namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator reason/StartedAcquiring I1126 07:00:38.868879 1 leaderelection.go:257] attempting to acquire leader lease openshift-managed-node-metadata-operator/cdb62f5e.managed.openshift.com... Nov 26 07:00:38.868 - 1s I namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator reason/StartedAcquiring I1126 07:00:38.868879 1 leaderelection.go:257] attempting to acquire leader lease openshift-managed-node-metadata-operator/cdb62f5e.managed.openshift.com... Nov 26 07:00:55.991 - 1s I namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator reason/Acquired I1126 07:00:55.991000 1 leaderelection.go:271] successfully acquired lease openshift-managed-node-metadata-operator/cdb62f5e.managed.openshift.com Nov 26 07:00:55.991 - 1s I namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator reason/Acquired I1126 07:00:55.991000 1 leaderelection.go:271] successfully acquired lease openshift-managed-node-metadata-operator/cdb62f5e.managed.openshift.com
Evidence
Test Code Location
pkg/monitortests/testframework/operatorloganalyzer/operator_lease_check.go:56-89
The problematic logic:
for podLocator, perPodLeaseIntervals := range podToLeaseIntervals { var lastAcquiringFrom *time.Time for _, interval := range perPodLeaseIntervals { switch interval.Message.Reason { case monitorapi.LeaseAcquiringStarted: // only overwrite if there isn't one already if lastAcquiringFrom == nil { lastAcquiringFrom = &interval.From } case monitorapi.LeaseAcquired: if lastAcquiringFrom == nil { // ERROR: This fails when we have duplicate Acquired events allIntervals := []string{} for _, curr := range perPodLeaseIntervals { allIntervals = append(allIntervals, curr.String()) } errs = append(errs, fmt.Errorf("missing acquiring stage for %v: all intervals\n\t%v", podLocator, strings.Join(allIntervals, "\n\t"))) } else { ret = append(ret, monitorapi.NewInterval(monitorapi.SourcePodLog, monitorapi.Info). Locator(interval.Locator). Message(monitorapi.NewMessage(). Reason(monitorapi.LeaseAcquiring). Constructed(monitorapi.ConstructionOwnerLeaseChecker). HumanMessage("Waiting for lease."), ). Display(). Build(*lastAcquiringFrom, interval.From), ) lastAcquiringFrom = nil // Cleared here, causing next Acquired to fail } } } }
Why Duplicate Events Occur
The managed-node-metadata-operator (or its log collector) appears to have logged each lease acquisition event twice. This could be due to:
- Log line deduplication issues
- Multiple log collectors capturing the same events
- Pod log streaming capturing the same content multiple times
- Timestamp collision causing duplicate interval creation
Impact
- Test flakes intermittently when pods log duplicate lease events
- Affects ROSA STS platform testing
- Pass rate: 94.29% (33/35 runs passed in last 14 days)
- Medium risk failure
Affected Versions
- 4.21 (observed in ROSA STS)
- Likely affects all versions with the lease-checker monitor test
Recommendations
The test should be more resilient to duplicate events. Fix options:
Option 1: Deduplicate events before processing
Add deduplication logic before processing lease intervals to remove exact duplicate events with the same timestamp, locator, and reason.
// Deduplicate intervals before processing dedupedLeaseIntervals := []monitorapi.Interval{} seen := map[string]bool{} for _, interval := range leaseIntervals { key := fmt.Sprintf("%v|%v|%v", interval.Locator.OldLocator(), interval.From.Unix(), interval.Message.Reason) if !seen[key] { seen[key] = true dedupedLeaseIntervals = append(dedupedLeaseIntervals, interval) } }
Option 2: Handle duplicate Acquired events gracefully
Modify the logic to skip duplicate LeaseAcquired events when lastAcquiringFrom is nil:
case monitorapi.LeaseAcquired: if lastAcquiringFrom == nil { // Check if this might be a duplicate by looking at the previous interval // If the previous interval was also LeaseAcquired with the same timestamp, skip it if len(ret) > 0 { lastInterval := ret[len(ret)-1] if lastInterval.From == interval.From && lastInterval.Locator.OldLocator() == interval.Locator.OldLocator() { // This is a duplicate, skip it continue } } // Otherwise, this is a real error allIntervals := []string{} for _, curr := range perPodLeaseIntervals { allIntervals = append(allIntervals, curr.String()) } errs = append(errs, fmt.Errorf("missing acquiring stage for %v: all intervals\n\t%v", podLocator, strings.Join(allIntervals, "\n\t"))) } else { ret = append(ret, monitorapi.NewInterval(monitorapi.SourcePodLog, monitorapi.Info). Locator(interval.Locator). Message(monitorapi.NewMessage(). Reason(monitorapi.LeaseAcquiring). Constructed(monitorapi.ConstructionOwnerLeaseChecker). HumanMessage("Waiting for lease."), ). Display(). Build(*lastAcquiringFrom, interval.From), ) lastAcquiringFrom = nil }
Option 3: Use a counter-based approach
Track how many LeaseAcquiringStarted and LeaseAcquired events we've seen and ensure they match:
for podLocator, perPodLeaseIntervals := range podToLeaseIntervals { var lastAcquiringFrom *time.Time acquiringCount := 0 acquiredCount := 0 for _, interval := range perPodLeaseIntervals { switch interval.Message.Reason { case monitorapi.LeaseAcquiringStarted: acquiringCount++ if lastAcquiringFrom == nil { lastAcquiringFrom = &interval.From } case monitorapi.LeaseAcquired: acquiredCount++ if lastAcquiringFrom == nil { // Only error if we truly haven't seen a StartedAcquiring if acquiringCount == 0 { errs = append(errs, fmt.Errorf("missing acquiring stage for %v", podLocator)) } continue // Skip duplicate } ret = append(ret, monitorapi.NewInterval(...) lastAcquiringFrom = nil } } }
Additional Information
- Prow Job: periodic-ci-openshift-release-master-nightly-4.21-e2e-rosa-sts-ovn
- Build ID: 1993549919880744960
- Job URL: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-rosa-sts-ovn/1993549919880744960
- Platform: ROSA STS (AWS)
- Failed Pod: managed-node-metadata-operator-b6f598659-685m6 in namespace openshift-managed-node-metadata-operator
- Test Framework: Monitor test (runs during e2e test execution to check for operator issues)
Filed by: dgoodwin@redhat.com