Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
None

Epic Name:
Disruption Corolated Event & Resource Monitoring
Epic Status:
Done
Activity Type:
Future Sustainability
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Size:
None

Target Version:
None
Release Blocker:
None

Often when we see disruption there are other events within the system that get us close to the root cause.

Etcd

compaction
disk write latency
network degradation
leader elections

Api server

increased latency
Request timeouts / failures

Prometheus metrics collection example

histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!="log",verb!~"WATCH|WATCHLIST|PROXY"}[5m])) by(resource,le))

histogram_quantile(0.99, irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m]))


(increase(etcd_server_leader_changes_seen_total{service="etcd"}[1m])) + 0.1

relates to

OCPBUGS-55755 Gather CI metrics to monitor apiserver/etcd health

Closed

Assignee:: Unassigned

Reporter:: Forrest Babcock

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/03/10 12:05 AM

Updated:: 2025/08/27 4:12 PM