Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62755

[release-4.20] cluster-api operator blips degraded every 35 minutes

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • CLOUD Sprint 278
    • 1
    • In Progress
    • Bug Fix
    • Hide
      The bug occurred when the cluster-api-operator's kubeconfig controller attempted to use a regenerated authentication token secret before the token value was fully populated.

      A user would experience a number of recurring, transient reconciliation errors in the event log every 30 minutes, which would cause a brief blip into degraded state for the operator.

      The controller now gracefully waits for the authentication token to be populated within the secret before proceeding, which prevents the operator from going into degraded state.

      With this fix, correct behavior is ensured by completing the routine token rotation without generating unnecessary error events, or degraded state blips.
      Show
      The bug occurred when the cluster-api-operator's kubeconfig controller attempted to use a regenerated authentication token secret before the token value was fully populated. A user would experience a number of recurring, transient reconciliation errors in the event log every 30 minutes, which would cause a brief blip into degraded state for the operator. The controller now gracefully waits for the authentication token to be populated within the secret before proceeding, which prevents the operator from going into degraded state. With this fix, correct behavior is ensured by completing the routine token rotation without generating unnecessary error events, or degraded state blips.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-62620. The following is the description of the original issue:

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [sig-arch] events should not repeat pathologically

      Extreme regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 100.00% to 46.43%.

      Sample (being evaluated) Release: 4.21
      Start Time: 2025-09-03T00:00:00Z
      End Time: 2025-10-01T12:00:00Z
      Success Rate: 46.43%
      Successes: 13
      Failures: 15
      Flakes: 0
      Base (historical) Release: 4.19
      Start Time: 2025-05-18T00:00:00Z
      End Time: 2025-06-17T00:00:00Z
      Success Rate: 100.00%
      Successes: 79
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      You can see the operator blipping degraded at a regular cadence in this chart

      The full error is:

      {  1 events happened too frequently
      
      event happened 24 times, something is wrong: clusteroperator/cluster-api hmsg/dbc7a25cc8 - reason/Status degraded error generating kubeconfig: token can't be empty (15:39:29Z) result=reject }
      

      This is failing routinely in this multiarch job, but I can't seem to find any hits on amd64.

      Note that 4.19 looks fully passing, but 4.20 looks equally bad, meaning we need to know ASAP how serious this is and if we can still GA 4.20 with it.

      Filed by: dgoodwin@redhat.com

              ddonati@redhat.com Damiano Donati
              openshift-trt OpenShift Technical Release Team
              None
              None
              Zhaohua Sun Zhaohua Sun
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: