Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-66065

ROSA openshift-managed-node-metadata-operator logs duplicate lease acquisition lines

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [Monitor:lease-checker][Jira:"Test Framework"] monitor test lease-checker interval construction

      Significant regression detected.
      Fishers Exact probability of a regression: 99.99%.
      Test pass rate dropped from 100.00% to 87.50%.

      Sample (being evaluated) Release: 4.21
      Start Time: 2025-11-19T00:00:00Z
      End Time: 2025-11-26T12:00:00Z
      Success Rate: 87.50%
      Successes: 21
      Failures: 3
      Flakes: 0
      Base (historical) Release: 4.19
      Start Time: 2025-05-18T00:00:00Z
      End Time: 2025-06-17T00:00:00Z
      Success Rate: 100.00%
      Successes: 108
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      Summary

      Monitor test lease-checker interval construction fails when duplicate LeaseAcquiringStarted events are logged

      Description

      The test [Monitor:lease-checker][Jira:"Test Framework"] monitor test lease-checker interval construction fails when a pod logs duplicate LeaseAcquiringStarted events for the same lease acquisition attempt.

      Root Cause

      The lease-checker monitor test has a logic flaw in how it handles duplicate lease acquisition events. When processing lease intervals:

      1. It expects alternating LeaseAcquiringStarted and LeaseAcquired events
      2. When it sees a LeaseAcquiringStarted event, it saves the timestamp (only if not already set)
      3. When it sees a LeaseAcquired event, it creates an interval and clears the saved timestamp
      4. If duplicate LeaseAcquiringStarted events exist, the second one is ignored
      5. If duplicate LeaseAcquired events exist, the second one fails because the timestamp was cleared after the first one

      The pattern in this failure:

      • Event 1: LeaseAcquiringStarted at 07:00:38.868 → lastAcquiringFrom set
      • Event 2: LeaseAcquiringStarted at 07:00:38.868 (duplicate) → ignored (lastAcquiringFrom already set)
      • Event 3: LeaseAcquired at 07:00:55.991 → interval created, lastAcquiringFrom cleared
      • Event 4: LeaseAcquired at 07:00:55.991 (duplicate) → ERROR! lastAcquiringFrom is nil

      Error Message

      failed during interval construction
      missing acquiring stage for namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator: all intervals
      	Nov 26 07:00:38.868 - 1s    I namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator reason/StartedAcquiring I1126 07:00:38.868879       1 leaderelection.go:257] attempting to acquire leader lease openshift-managed-node-metadata-operator/cdb62f5e.managed.openshift.com...
      	Nov 26 07:00:38.868 - 1s    I namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator reason/StartedAcquiring I1126 07:00:38.868879       1 leaderelection.go:257] attempting to acquire leader lease openshift-managed-node-metadata-operator/cdb62f5e.managed.openshift.com...
      	Nov 26 07:00:55.991 - 1s    I namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator reason/Acquired I1126 07:00:55.991000       1 leaderelection.go:271] successfully acquired lease openshift-managed-node-metadata-operator/cdb62f5e.managed.openshift.com
      	Nov 26 07:00:55.991 - 1s    I namespace/openshift-managed-node-metadata-operator node/ip-10-0-40-117.us-west-2.compute.internal pod/managed-node-metadata-operator-b6f598659-685m6 uid/0aa111c0-c26e-41f3-93d0-9a78e8a8cae2 container/managed-node-metadata-operator reason/Acquired I1126 07:00:55.991000       1 leaderelection.go:271] successfully acquired lease openshift-managed-node-metadata-operator/cdb62f5e.managed.openshift.com
      

      Evidence

      Test Code Location

      pkg/monitortests/testframework/operatorloganalyzer/operator_lease_check.go:56-89

      The problematic logic:

      for podLocator, perPodLeaseIntervals := range podToLeaseIntervals {
          var lastAcquiringFrom *time.Time
          for _, interval := range perPodLeaseIntervals {
              switch interval.Message.Reason {
              case monitorapi.LeaseAcquiringStarted:
                  // only overwrite if there isn't one already
                  if lastAcquiringFrom == nil {
                      lastAcquiringFrom = &interval.From
                  }
      
              case monitorapi.LeaseAcquired:
                  if lastAcquiringFrom == nil {
                      // ERROR: This fails when we have duplicate Acquired events
                      allIntervals := []string{}
                      for _, curr := range perPodLeaseIntervals {
                          allIntervals = append(allIntervals, curr.String())
                      }
                      errs = append(errs, fmt.Errorf("missing acquiring stage for %v: all intervals\n\t%v", podLocator, strings.Join(allIntervals, "\n\t")))
                  } else {
                      ret = append(ret, monitorapi.NewInterval(monitorapi.SourcePodLog, monitorapi.Info).
                          Locator(interval.Locator).
                          Message(monitorapi.NewMessage().
                              Reason(monitorapi.LeaseAcquiring).
                              Constructed(monitorapi.ConstructionOwnerLeaseChecker).
                              HumanMessage("Waiting for lease."),
                          ).
                          Display().
                          Build(*lastAcquiringFrom, interval.From),
                      )
                      lastAcquiringFrom = nil  // Cleared here, causing next Acquired to fail
                  }
              }
          }
      }
      

      Why Duplicate Events Occur

      The managed-node-metadata-operator (or its log collector) appears to have logged each lease acquisition event twice. This could be due to:

      • Log line deduplication issues
      • Multiple log collectors capturing the same events
      • Pod log streaming capturing the same content multiple times
      • Timestamp collision causing duplicate interval creation

      Impact

      • Test flakes intermittently when pods log duplicate lease events
      • Affects ROSA STS platform testing
      • Pass rate: 94.29% (33/35 runs passed in last 14 days)
      • Medium risk failure

      Affected Versions

      • 4.21 (observed in ROSA STS)
      • Likely affects all versions with the lease-checker monitor test

      Recommendations

      The test should be more resilient to duplicate events. Fix options:

      Option 1: Deduplicate events before processing

      Add deduplication logic before processing lease intervals to remove exact duplicate events with the same timestamp, locator, and reason.

      // Deduplicate intervals before processing
      dedupedLeaseIntervals := []monitorapi.Interval{}
      seen := map[string]bool{}
      for _, interval := range leaseIntervals {
          key := fmt.Sprintf("%v|%v|%v", interval.Locator.OldLocator(), interval.From.Unix(), interval.Message.Reason)
          if !seen[key] {
              seen[key] = true
              dedupedLeaseIntervals = append(dedupedLeaseIntervals, interval)
          }
      }
      

      Option 2: Handle duplicate Acquired events gracefully

      Modify the logic to skip duplicate LeaseAcquired events when lastAcquiringFrom is nil:

      case monitorapi.LeaseAcquired:
          if lastAcquiringFrom == nil {
              // Check if this might be a duplicate by looking at the previous interval
              // If the previous interval was also LeaseAcquired with the same timestamp, skip it
              if len(ret) > 0 {
                  lastInterval := ret[len(ret)-1]
                  if lastInterval.From == interval.From &&
                     lastInterval.Locator.OldLocator() == interval.Locator.OldLocator() {
                      // This is a duplicate, skip it
                      continue
                  }
              }
              // Otherwise, this is a real error
              allIntervals := []string{}
              for _, curr := range perPodLeaseIntervals {
                  allIntervals = append(allIntervals, curr.String())
              }
              errs = append(errs, fmt.Errorf("missing acquiring stage for %v: all intervals\n\t%v", podLocator, strings.Join(allIntervals, "\n\t")))
          } else {
              ret = append(ret, monitorapi.NewInterval(monitorapi.SourcePodLog, monitorapi.Info).
                  Locator(interval.Locator).
                  Message(monitorapi.NewMessage().
                      Reason(monitorapi.LeaseAcquiring).
                      Constructed(monitorapi.ConstructionOwnerLeaseChecker).
                      HumanMessage("Waiting for lease."),
                  ).
                  Display().
                  Build(*lastAcquiringFrom, interval.From),
              )
              lastAcquiringFrom = nil
          }
      

      Option 3: Use a counter-based approach

      Track how many LeaseAcquiringStarted and LeaseAcquired events we've seen and ensure they match:

      for podLocator, perPodLeaseIntervals := range podToLeaseIntervals {
          var lastAcquiringFrom *time.Time
          acquiringCount := 0
          acquiredCount := 0
      
          for _, interval := range perPodLeaseIntervals {
              switch interval.Message.Reason {
              case monitorapi.LeaseAcquiringStarted:
                  acquiringCount++
                  if lastAcquiringFrom == nil {
                      lastAcquiringFrom = &interval.From
                  }
      
              case monitorapi.LeaseAcquired:
                  acquiredCount++
                  if lastAcquiringFrom == nil {
                      // Only error if we truly haven't seen a StartedAcquiring
                      if acquiringCount == 0 {
                          errs = append(errs, fmt.Errorf("missing acquiring stage for %v", podLocator))
                      }
                      continue  // Skip duplicate
                  }
                  ret = append(ret, monitorapi.NewInterval(...)
                  lastAcquiringFrom = nil
              }
          }
      }
      

      Additional Information

      Filed by: dgoodwin@redhat.com

              rhn-engineering-dgoodwin Devan Goodwin
              openshift-trt OpenShift Technical Release Team
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: