Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43113

SNO Snapshot Controller excessive restarts

XMLWordPrintable

    • Important
    • None
    • 3
    • OCPEDGE Sprint 261, OCPEDGE Sprint 263
    • 2
    • Proposed
    • False
    • Hide

      None

      Show
      None

      [sig-architecture] platform pods in ns/openshift-cluster-storage-operator should not exit an excessive amount of times

      The Snapshot controller on SNO is restarting a lot during kube api operator progressing, the error is due to not being able to pull the volume snapshots from kubeapi during start up.

      After some investigation I think the best approach here will be to modify the interval the snapshot-controller waits for until it continues it's operation. We can't really do health check probes or startup probes on this deployment since the restart mechanism is part of the operand and it's not kubernetes that's restarting the pod. It might be best to utilize the --retry-crd-interval-max for SNO deployments of the operand to account for the API server not being reachable during rollouts. The operand is applied by the operator with these args and the deployment is ran through a template processor that we should be able to hook into for updating this behavior. (template replace logic)

      Note: This error does seem to be present in the 4.17 branches as well

      Ex run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29183/pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade/1844749579753361408

              sakbas@redhat.com Suleyman Akbas
              ehila@redhat.com Egli Hila
              Neil Hamza Neil Hamza
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: