Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-26037

HCP oc commands failed after longrun for 10+ days etcdserver: no leader

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Undefined Undefined
    • None
    • 4.14.z
    • Etcd
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      A ticket is opened to SRE to help to collect log info on management cluster

      https://issues.redhat.com/browse/OHSS-30451 

      A Reliability-v2 test(loaded long run) started on this cluster on [2023-12-18 08:12:15 UTC], since about [2024-01-02 10:56:54 UTC] (about 10 day+ later) the the oc commands started to get many failures from time to time'Unable to connect to the server: EOF' or 'error: EOF', responses were slow. This issue lasts for 20+ hours till now.Before this test a 7 days reliability test was run on this cluster and I did not see this issue.

      [Log collected from SRE]From the openshift-oauth-apiserver logs:

      E0102 10:50:45.084622       1 watcher.go:253] watch chan error: etcdserver: no leader
      I0102 10:50:45.084678       1 reflector.go:453] storage/cacher.go:/oauth/authorizetokens: retrying watch of *oauth.OAuthAuthorizeToken internal error: Internal error occurred: etcdserver: no leader
      E0102 10:50:45.086396       1 watcher.go:253] watch chan error: etcdserver: no leader
      I0102 10:50:45.086432       1 reflector.go:453] storage/cacher.go:/oauth/authorizetokens: retrying watch of *oauth.OAuthAuthorizeToken internal error: Internal error occurred: etcdserver: no leader
      E0102 10:50:45.086937       1 watcher.go:253] watch chan error: etcdserver: no leader
      I0102 10:50:45.086976       1 reflector.go:453] storage/cacher.go:/groups: retrying watch of *user.Group internal error: Internal error occurred: etcdserver: no leader
      E0102 10:50:45.087680       1 watcher.go:253] watch chan error: etcdserver: no leader
      I0102 10:50:45.087709       1 reflector.go:453] storage/cacher.go:/oauth/authorizetokens: retrying watch of *oauth.OAuthAuthorizeToken internal error: Internal error occurred: etcdserver: no leader
      E0102 10:50:45.087877       1 watcher.go:253] watch chan error: etcdserver: no leader
      I0102 10:50:45.087908       1 reflector.go:453] storage/cacher.go:/useridentities: retrying watch of *user.Identity internal error: Internal error occurred: etcdserver: no leader

      Version-Release number of selected component (if applicable):

      4.14.3   

      How reproducible:

      First time

      Steps to Reproduce:

      1. Create an HCP cluster
      2. Run reliability-v2 test https://github.com/openshift/svt/tree/master/reliability-v2. The test will long run and simulate multiple customers usage on the cluster with oc commands. config is: 1 admin, 10 dev-test, 1 dev-prod
      3. Monitor the test failures during the longrun     

      Actual results:

      oc commands failed for 'Unable to connect to the server: EOF' or 'error: EOF' after longrun for 10+ days.    

      Expected results:

      oc commands has no failures and slow response

      Additional info:

      OpenShift Version:          4.14.3
      Region: us-west-2
      Availability:Control Plane: MultiAZData Plane: MultiAZ
      Nodes:Compute (Autoscaled): 3-6Compute (current): 3

      If you need more debug log or info from management cluster, please ask in https://issues.redhat.com/browse/OHSS-30451 to let SRE to help to collect. QE can help to collect log or info from the host cluster.

            dwest@redhat.com Dean West
            rhn-support-qili Qiujie Li
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: