-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
None
-
Future Sustainability
-
False
-
-
False
-
None
-
None
-
None
-
None
As a Support Eng, I need to know whether an issue is etcd related or not to choose the right escalation path.
As an etcd Eng, I need to understand what the issue is without spending a whole day on must-gathers and figuring out the right information is missing.
—
rhn-support-pducai also wrote a nice troubleshooting script in https://github.com/peterducai/openshift-etcd-suite/blob/main/etcd.sh
He also documented a lot of learnings in: https://gitlab.cee.redhat.com/sbr-shift-emea/troubleshooter4/-/blob/master/etcd.md
We should add more of that information to the must-gather script:
- add pprof goroutine dumps from CEO (see bug for more info)
- CP disk benchmarks (
ETCD-269) to understand whether the disks are good enough for the cluster size Etcdctl key distribution by resource type and their total and average size (eg secrets / configmaps)- etcd file disk size (i.e. defrag percentage)
- Clock sync troubles, missing NTP setup (chrony journalctl logs)
- Basic network troubleshooting like "ip -s link show" to understand dropped TCP package rates on CP nodes
- Audit Log information to answer queries like:
- who is listing all pods (or other resources) every minute?
- who creates so many watches?
- Michael Washer had plenty of others in https://access.redhat.com/support/cases/#/case/03121434
- see also https://access.redhat.com/solutions/5743951
AC:
- Implement the above requirements into must-gather
- duplicates
-
OCPBUGS-16223 Include etcd object count while collecting must-gather
-
- Closed
-