Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Labels:
None

Activity Type:
Future Sustainability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
etcdsup
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

As a Support Eng, I need to know whether an issue is etcd related or not to choose the right escalation path.

As an etcd Eng, I need to understand what the issue is without spending a whole day on must-gathers and figuring out the right information is missing.

—

rhn-support-pducai also wrote a nice troubleshooting script in https://github.com/peterducai/openshift-etcd-suite/blob/main/etcd.sh

He also documented a lot of learnings in: https://gitlab.cee.redhat.com/sbr-shift-emea/troubleshooter4/-/blob/master/etcd.md

We should add more of that information to the must-gather script:

add pprof goroutine dumps from CEO (see bug for more info)
CP disk benchmarks (~~ETCD-269~~) to understand whether the disks are good enough for the cluster size
~~Etcdctl key distribution by resource type and their total and average size (eg secrets / configmaps)~~
- done in https://github.com/openshift/must-gather/pull/372
etcd file disk size (i.e. defrag percentage)
Clock sync troubles, missing NTP setup (chrony journalctl logs)
Basic network troubleshooting like "ip -s link show" to understand dropped TCP package rates on CP nodes
Audit Log information to answer queries like:
- who is listing all pods (or other resources) every minute?
- who creates so many watches?
- Michael Washer had plenty of others in https://access.redhat.com/support/cases/#/case/03121434
- see also https://access.redhat.com/solutions/5743951

AC:

Implement the above requirements into must-gather

duplicates

OCPBUGS-16223 Include etcd object count while collecting must-gather

Closed

1.	Add etcd metrics to must-gather	Closed	Natalie Ammerman (Inactive)
2.	Add network interface stats to must-gather	Closed	Natalie Ammerman (Inactive)
3.	Add clock sync info to must-gather	Closed	Natalie Ammerman (Inactive)
4.	Add pprof goroutine dumps from CEO	Closed	Natalie Ammerman (Inactive)

Assignee:: Natalie Ammerman (Inactive)

Reporter:: Thomas Jungblut

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2022/05/12 7:24 AM

Updated:: 2025/09/13 9:26 AM

Resolved:: 2025/07/01 7:24 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates