Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-2031

Enhance Disruption Associated Event & Resource Tracking Support

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • Disruption Corolated Event & Resource Monitoring
    • Future Sustainability
    • 0% To Do, 0% In Progress, 100% Done
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • None
    • None
    • None

      Often when we see disruption there are other events within the system that get us close to the root cause.

       

      Etcd

      • compaction
      • disk write latency
      • network degradation
      • leader elections

      Api server

      • increased latency
      • Request timeouts / failures

      Prometheus metrics collection example

      histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!="log",verb!~"WATCH|WATCHLIST|PROXY"}[5m])) by(resource,le))
      
      histogram_quantile(0.99, irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m]))
      
      
      (increase(etcd_server_leader_changes_seen_total{service="etcd"}[1m])) + 0.1
      

              Unassigned Unassigned
              rh-ee-fbabcock Forrest Babcock
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: