Description of problem:
A ticket is opened to SRE to help to collect log info on management cluster
https://issues.redhat.com/browse/OHSS-30451
A Reliability-v2 test(loaded long run) started on this cluster on [2023-12-18 08:12:15 UTC], since about [2024-01-02 10:56:54 UTC] (about 10 day+ later) the the oc commands started to get many failures from time to time'Unable to connect to the server: EOF' or 'error: EOF', responses were slow. This issue lasts for 20+ hours till now.Before this test a 7 days reliability test was run on this cluster and I did not see this issue.
[Log collected from SRE]From the openshift-oauth-apiserver logs:
E0102 10:50:45.084622 1 watcher.go:253] watch chan error: etcdserver: no leader I0102 10:50:45.084678 1 reflector.go:453] storage/cacher.go:/oauth/authorizetokens: retrying watch of *oauth.OAuthAuthorizeToken internal error: Internal error occurred: etcdserver: no leader E0102 10:50:45.086396 1 watcher.go:253] watch chan error: etcdserver: no leader I0102 10:50:45.086432 1 reflector.go:453] storage/cacher.go:/oauth/authorizetokens: retrying watch of *oauth.OAuthAuthorizeToken internal error: Internal error occurred: etcdserver: no leader E0102 10:50:45.086937 1 watcher.go:253] watch chan error: etcdserver: no leader I0102 10:50:45.086976 1 reflector.go:453] storage/cacher.go:/groups: retrying watch of *user.Group internal error: Internal error occurred: etcdserver: no leader E0102 10:50:45.087680 1 watcher.go:253] watch chan error: etcdserver: no leader I0102 10:50:45.087709 1 reflector.go:453] storage/cacher.go:/oauth/authorizetokens: retrying watch of *oauth.OAuthAuthorizeToken internal error: Internal error occurred: etcdserver: no leader E0102 10:50:45.087877 1 watcher.go:253] watch chan error: etcdserver: no leader I0102 10:50:45.087908 1 reflector.go:453] storage/cacher.go:/useridentities: retrying watch of *user.Identity internal error: Internal error occurred: etcdserver: no leader
Version-Release number of selected component (if applicable):
4.14.3
How reproducible:
First time
Steps to Reproduce:
1. Create an HCP cluster 2. Run reliability-v2 test https://github.com/openshift/svt/tree/master/reliability-v2. The test will long run and simulate multiple customers usage on the cluster with oc commands. config is: 1 admin, 10 dev-test, 1 dev-prod 3. Monitor the test failures during the longrun
Actual results:
oc commands failed for 'Unable to connect to the server: EOF' or 'error: EOF' after longrun for 10+ days.
Expected results:
oc commands has no failures and slow response
Additional info:
OpenShift Version: 4.14.3 Region: us-west-2 Availability:Control Plane: MultiAZData Plane: MultiAZ Nodes:Compute (Autoscaled): 3-6Compute (current): 3
If you need more debug log or info from management cluster, please ask in https://issues.redhat.com/browse/OHSS-30451 to let SRE to help to collect. QE can help to collect log or info from the host cluster.