Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8214

[Reliability]compact cluster: high etcd heartbeat failures on 2 nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Can't Do
    • Icon: Normal Normal
    • None
    • 4.13
    • Etcd
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      4.13 compact cluster on AWS with 

      Version-Release number of selected component (if applicable):

      4.13.0-0.nightly-2023-02-22-192922

      How reproducible:

      first time test on compact cluster I saw this issue

      Steps to Reproduce:

      1. 4.13 compact cluster on AWS with type m5.2xlarge OVN network(A compact cluster that has three master nodes that are also worker nodes.)
      2. Run reliability-v2 test to continuous load the cluster with 5 developer users for 7 days. The 5 developers loop to do tasks(new project, new app from openshift templates, curl app url, build, scaleup and down test pods, check pods, delete project) concurrently.
      https://github.com/openshift/svt/tree/master/reliability-v2
      3. Check the dittybopper etcd-cluster-info dashboard
      https://github.com/cloud-bulldozer/performance-dashboards 

      Actual results:

      1. 4k+ etcd heartbeat failures on 2 of the nodes 10.0.182.142 and 10.0.193.40
      Metrics:
      etcd_server_heartbeat_send_failures_total{namespace="openshift-etcd",pod=~"$pod"}
      See screenshot https://drive.google.com/file/d/1Pf2f-4qmY-tLlYaJoQCnX13dwAWkTlC6/view?usp=share_link
      
      2. The etcd_disk_wal_fsync_duration on 3 nodes jump up and down during the test, max can reach to 200 ms to 300 ms. The high/low time is in a similar pattern of test pod distribution on the node, and the node's cpu usage/disk throughput/network utilization.
      See screenshots https://drive.google.com/file/d/10CiNTK6LpzrGMDzUx1s6yri7D3anWQ35/view?usp=share_link
      https://drive.google.com/file/d/1DtHqEnIEWPCTh7EnN9AfzE2-jEUIALEZ/view?usp=share_link
      
      Metrics:
      histogram_quantile(0.99, sum(irate(etcd_disk_wal_fsync_duration_seconds_bucket{namespace="openshift-etcd"}[2m])) by (pod, le))  
      
      ---------
      In https://docs.openshift.com/container-platform/4.12/scalability_and_performance/recommended-host-practices.html, it saids
      Slow disks and disk activity from other processes can cause long fsync latencies.
      Those latencies can cause etcd to miss heartbeats, not commit new proposals to the disk on time, and ultimately experience request timeouts and temporary leader loss. 
      
      I think it could be the user worker load generated by reliability test that compete the IO with etcd. 
      
      However with the high wal fsync and heartbeat failures, there is no big impact to the user's tasks in the reliability test.

      Expected results:

      1. Why the wal_fsync_duration on 3 nodes jump up and down?
      2. Is the high number of etcd heartbeat failures a risk?
      3. In a non compact cluster with half of vm size - m5.xlarge, reliability test can run with 15 developer users and wal_fsync_duration keeps under 15 ms. We need some documentation about how to plan user workload on a compact cluster to avoid impact to etcd.

      Additional info:

      must-gather is uploaded here https://drive.google.com/file/d/1sQ4EUoJI8lYuda-ANXLMiK8uqvisH9aC/view?usp=share_link

        1. screenshot-1.png
          74 kB
          Thomas Jungblut

              dwest@redhat.com Dean West
              rhn-support-qili Qiujie Li
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: