-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
Disruption Corolated Event & Resource Monitoring
-
Future Sustainability
-
0% To Do, 0% In Progress, 100% Done
-
False
-
-
False
-
Not Selected
-
None
-
None
-
None
Often when we see disruption there are other events within the system that get us close to the root cause.
Etcd
- compaction
- disk write latency
- network degradation
- leader elections
Api server
- increased latency
- Request timeouts / failures
Prometheus metrics collection example
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!="log",verb!~"WATCH|WATCHLIST|PROXY"}[5m])) by(resource,le)) histogram_quantile(0.99, irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])) (increase(etcd_server_leader_changes_seen_total{service="etcd"}[1m])) + 0.1
- relates to
-
OCPBUGS-55755 Gather CI metrics to monitor apiserver/etcd health
-
- Closed
-