Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-38751

openshift-marketplace pods affect etcd sync latencies

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.16
    • OLM
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      On a freshly installed cluster, etcd disk sync latencies (backend and WAL sync durations) sit higher than the SLAs (10ms) and leave one etcd member considerably higher than the other two. (example: 23ms, 6ms, 6ms)
      Looking further into the metrics, each time one of the default four CatalogSources spawn a pod, this pod writes ~1GB worth of data to disk, increments writes_merge by ~5k, and spikes both WAL and backend sync durations for etcd.
      
      The magnitude of this spike is higher than we are even able to reproduce with `fio`.
      (At least, as of today.)
      

      Version-Release number of selected component (if applicable):

      4.16.0-0.nightly-2024-08-19-053157
      4.15.0-0.nightly-2024-08-19-025745 

      How reproducible:

      100%
      

      Steps to Reproduce:

          1. Install a cluster
          2. Plot the following queries:
      changes in etcd DB sync latency: histogram_quantile(0.99, sum(irate(etcd_disk_backend_commit_duration_seconds_bucket{namespace="openshift-etcd"}[2m])) by (pod, le))
      changes in etcd WAL sync latency: histogram_quantile(0.99, sum(irate(etcd_disk_wal_fsync_duration_seconds_bucket{namespace="openshift-etcd"}[2m])) by (pod, le))
      namespace cpu usage: topk( 10 , sum( irate(container_cpu_usage_seconds_total{container!~"POD|",namespace=~"openshift-.*"}[$interval])*100 ) by (namespace) )
      

      Actual results:

      See spikes in CPU from openshift-marketplace namespace that align with spikes in fsync latencies.
      

      Expected results:

      Don't see any spikes in etcd latency during stable operation.
      

      Additional info:

      I have must-gather and prometheus dumps to share links in the comments.
      
      screenshots:
      Screenshot 2024-08-16 at 17.24.57_etcd_backend-sync_linear copy.png
      Screenshot 2024-08-16 at 17.25.06_namespace-cpu_full copy.png
      
      See this doc for more information: https://docs.google.com/document/d/1xV5g_W2rpyKoVrzn4DdcdrJtvCmr1fGNEsvtZvaL-Jk/edit
      

              lmohanty@redhat.com Lalatendu Mohanty
              ancollin@redhat.com Andrew Collins
              Jian Zhang Jian Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: