Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Etcd
Labels:
- PerfScale

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

A ticket is opened to SRE to help to collect log info on management cluster

https://issues.redhat.com/browse/OHSS-30451

A Reliability-v2 test(loaded long run) started on this cluster on [2023-12-18 08:12:15 UTC], since about [2024-01-02 10:56:54 UTC] (about 10 day+ later) the the oc commands started to get many failures from time to time'Unable to connect to the server: EOF' or 'error: EOF', responses were slow. This issue lasts for 20+ hours till now.Before this test a 7 days reliability test was run on this cluster and I did not see this issue.

[Log collected from SRE]From the openshift-oauth-apiserver logs:

E0102 10:50:45.084622       1 watcher.go:253] watch chan error: etcdserver: no leader
I0102 10:50:45.084678       1 reflector.go:453] storage/cacher.go:/oauth/authorizetokens: retrying watch of *oauth.OAuthAuthorizeToken internal error: Internal error occurred: etcdserver: no leader
E0102 10:50:45.086396       1 watcher.go:253] watch chan error: etcdserver: no leader
I0102 10:50:45.086432       1 reflector.go:453] storage/cacher.go:/oauth/authorizetokens: retrying watch of *oauth.OAuthAuthorizeToken internal error: Internal error occurred: etcdserver: no leader
E0102 10:50:45.086937       1 watcher.go:253] watch chan error: etcdserver: no leader
I0102 10:50:45.086976       1 reflector.go:453] storage/cacher.go:/groups: retrying watch of *user.Group internal error: Internal error occurred: etcdserver: no leader
E0102 10:50:45.087680       1 watcher.go:253] watch chan error: etcdserver: no leader
I0102 10:50:45.087709       1 reflector.go:453] storage/cacher.go:/oauth/authorizetokens: retrying watch of *oauth.OAuthAuthorizeToken internal error: Internal error occurred: etcdserver: no leader
E0102 10:50:45.087877       1 watcher.go:253] watch chan error: etcdserver: no leader
I0102 10:50:45.087908       1 reflector.go:453] storage/cacher.go:/useridentities: retrying watch of *user.Identity internal error: Internal error occurred: etcdserver: no leader

Version-Release number of selected component (if applicable):

4.14.3

How reproducible:

First time

Steps to Reproduce:

1. Create an HCP cluster
2. Run reliability-v2 test https://github.com/openshift/svt/tree/master/reliability-v2. The test will long run and simulate multiple customers usage on the cluster with oc commands. config is: 1 admin, 10 dev-test, 1 dev-prod
3. Monitor the test failures during the longrun

Actual results:

oc commands failed for 'Unable to connect to the server: EOF' or 'error: EOF' after longrun for 10+ days.

Expected results:

oc commands has no failures and slow response

Additional info:

OpenShift Version:          4.14.3
Region: us-west-2
Availability:Control Plane: MultiAZData Plane: MultiAZ
Nodes:Compute (Autoscaled): 3-6Compute (current): 3

If you need more debug log or info from management cluster, please ask in https://issues.redhat.com/browse/OHSS-30451 to let SRE to help to collect. QE can help to collect log or info from the host cluster.

Assignee:: Dean West

Reporter:: Qiujie Li

Need Info From:: None

Contributors:: None

QA Contact:: Ge Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/01/04 2:27 AM

Updated:: 2025/07/24 11:33 AM

Resolved:: 2024/01/12 8:00 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates