Loading...

XML

Word

Printable

Type: Bug
Resolution: Can't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Etcd
Labels:
- PerfScale

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

4.13 compact cluster on AWS with

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-22-192922

How reproducible:

first time test on compact cluster I saw this issue

Steps to Reproduce:

1. 4.13 compact cluster on AWS with type m5.2xlarge OVN network(A compact cluster that has three master nodes that are also worker nodes.)
2. Run reliability-v2 test to continuous load the cluster with 5 developer users for 7 days. The 5 developers loop to do tasks(new project, new app from openshift templates, curl app url, build, scaleup and down test pods, check pods, delete project) concurrently.
https://github.com/openshift/svt/tree/master/reliability-v2
3. Check the dittybopper etcd-cluster-info dashboard
https://github.com/cloud-bulldozer/performance-dashboards

Actual results:

1. 4k+ etcd heartbeat failures on 2 of the nodes 10.0.182.142 and 10.0.193.40
Metrics:
etcd_server_heartbeat_send_failures_total{namespace="openshift-etcd",pod=~"$pod"}
See screenshot https://drive.google.com/file/d/1Pf2f-4qmY-tLlYaJoQCnX13dwAWkTlC6/view?usp=share_link

2. The etcd_disk_wal_fsync_duration on 3 nodes jump up and down during the test, max can reach to 200 ms to 300 ms. The high/low time is in a similar pattern of test pod distribution on the node, and the node's cpu usage/disk throughput/network utilization.
See screenshots https://drive.google.com/file/d/10CiNTK6LpzrGMDzUx1s6yri7D3anWQ35/view?usp=share_link
https://drive.google.com/file/d/1DtHqEnIEWPCTh7EnN9AfzE2-jEUIALEZ/view?usp=share_link

Metrics:
histogram_quantile(0.99, sum(irate(etcd_disk_wal_fsync_duration_seconds_bucket{namespace="openshift-etcd"}[2m])) by (pod, le))  

---------
In https://docs.openshift.com/container-platform/4.12/scalability_and_performance/recommended-host-practices.html, it saids
Slow disks and disk activity from other processes can cause long fsync latencies.
Those latencies can cause etcd to miss heartbeats, not commit new proposals to the disk on time, and ultimately experience request timeouts and temporary leader loss. 

I think it could be the user worker load generated by reliability test that compete the IO with etcd. 

However with the high wal fsync and heartbeat failures, there is no big impact to the user's tasks in the reliability test.

Expected results:

1. Why the wal_fsync_duration on 3 nodes jump up and down?
2. Is the high number of etcd heartbeat failures a risk?
3. In a non compact cluster with half of vm size - m5.xlarge, reliability test can run with 15 developer users and wal_fsync_duration keeps under 15 ms. We need some documentation about how to plan user workload on a compact cluster to avoid impact to etcd.

Additional info:

must-gather is uploaded here https://drive.google.com/file/d/1sQ4EUoJI8lYuda-ANXLMiK8uqvisH9aC/view?usp=share_link

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

screenshot-1.png
2023/03/09 10:10 AM
74 kB
Thomas Jungblut

is related to

ETCD-379 Etcd Product Support and Bugs

Closed

Assignee:: Dean West

Reporter:: Qiujie Li

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/03/02 9:36 AM

Updated:: 2023/03/15 8:43 AM

Resolved:: 2023/03/14 3:45 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates