Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29453

catalogd crash loops after etcd restore

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • 4.16.0
    • 4.15.0, 4.16.0
    • OLM / Registry
    • No
    • OPECO 249
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the catalogd component could crash loop after an etcd restore. This was due to the garbage collection process causing a looping failure state when the API server was unreachable. This bug fix updates catalogd to add a retry loop, and as a result catalogd no longer crashes in this scenario. (link:https://issues.redhat.com/browse/OCPBUGS-29453[*OCPBUGS-29453*])
      Show
      * Previously, the catalogd component could crash loop after an etcd restore. This was due to the garbage collection process causing a looping failure state when the API server was unreachable. This bug fix updates catalogd to add a retry loop, and as a result catalogd no longer crashes in this scenario. (link: https://issues.redhat.com/browse/OCPBUGS-29453 [* OCPBUGS-29453 *])
    • Bug Fix
    • Done

      Description of problem:

      The etcd team has introduced an e2e test that exercises a full etcd backup and restore cycle in OCP [1].
      
      We run those tests as part of our PR builds and since 4.15 [2] (also 4.16 [3]), we have failed runs with the catalogd-controller-manager crash looping:
      
      1 events happened too frequently
      event [namespace/openshift-catalogd node/ip-10-0-25-29.us-west-2.compute.internal pod/catalogd-controller-manager-768bb57cdb-nwbhr hmsg/47b381d71b - Back-off restarting failed container manager in pod catalogd-controller-manager-768bb57cdb-nwbhr_openshift-catalogd(aa38d084-ecb7-4588-bd75-f95adb4f5636)] happened 44 times}
      
      
      I assume something in that controller doesn't really deal gracefully with the restoration process of etcd, or the apiserver being down for some time.
      
      
      [1] https://github.com/openshift/origin/blob/master/test/extended/dr/recovery.go#L97
      
      [2] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1205/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1757443629380538368
      
      [3] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1191/pull-ci-openshift-cluster-etcd-operator-release-4.15-e2e-aws-etcd-recovery/1752293248543494144
      
      

      Version-Release number of selected component (if applicable):

      > 4.15

      How reproducible:

      always by running the test

      Steps to Reproduce:

      Run the test:
      
      [sig-etcd][Feature:DisasterRecovery][Suite:openshift/etcd/recovery][Timeout:2h] [Feature:EtcdRecovery][Disruptive] Recover with snapshot with two unhealthy nodes and lost quorum [Serial]     
      
      and observe the event invariant failing on it crash looping

      Actual results:

      catalogd-controller-manager crash loops and causes our CI jobs to fail

      Expected results:

      our e2e job is green again and catalogd-controller-manager doesn't crash loop       

      Additional info:

       

            rh-ee-bpalmer Bryce Palmer
            tjungblu@redhat.com Thomas Jungblut
            Jia Fan Jia Fan
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: