Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.14, 4.15
Component/s: Networking / openshift-sdn
Labels:
- PerfScale

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:
None
Target Version:

4.15.0
Release Blocker:
Approved
Sprint:
SDN Sprint 242
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Release Note Not Required
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

This is to track the SDN specific issue in https://issues.redhat.com/browse/OCPBUGS-18389

4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.z in node-density (lite) test

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-11-201102

How reproducible:

Everytime

Steps to Reproduce:

1. Install a SDN cluster and scale up to 24 worker nodes, install 3 infra nodes and move monitoring, ingress, registry components to infra nodes. 
2. Run node-density (lite) test with 245 pod per node
3. Compare the pod ready latency to 4.13.z, and 4.14 ec4

Actual results:

4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.10

Expected results:

4.14 should have similar pod ready latency compared to previous release

Additional info:

OCP Version	Flexy Id	Scale Ci Job	Grafana URL	Cloud	Arch Type	Network Type	Worker Count	PODS_PER_NODE	Avg Pod Ready (ms)	P99 Pod Ready (ms)	Must-gather
4.14.0-ec.4	231559	292	087eb40c-6600-4db3-a9fd-3b959f4a434a	aws	amd64	SDN	24	245	2186	3256	https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link
4.14.0-0.nightly-2023-09-02-132842	231558	291	62404e34-672e-4168-b4cc-0bd575768aad	aws	amd64	SDN	24	245	58725	294279	https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link

With the new multus image provided by dcbw@redhat.com in https://issues.redhat.com/browse/OCPBUGS-18389, SDN 24 nodes's latency is similar to without the fix.

% oc -n openshift-network-operator get deployment.apps/network-operator -o yaml | grep MULTUS_IMAGE -A 1
        - name: MULTUS_IMAGE
          value: quay.io/dcbw/multus-cni:informer 
 % oc get pod -n openshift-multus -o yaml | grep image: | grep multus
      image: quay.io/dcbw/multus-cni:informer
....

OCP Version	Flexy Id	Scale Ci Job	Grafana URL	Cloud	Arch Type	Network Type	Worker Count	PODS_PER_NODE	Avg Pod Ready (ms)	P99 Pod Ready (ms)	Must-gather
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer	232389	314	f2c290c1-73ea-4f10-a797-3ab9d45e94b3	aws	amd64	SDN	24	245	61234	311776	https://drive.google.com/file/d/1o7JXJAd_V3Fzw81pTaLXQn1ms44lX6v5/view?usp=drive_link
4.14.0-ec.4	231559	292	087eb40c-6600-4db3-a9fd-3b959f4a434a	aws	amd64	SDN	24	245	2186	3256	https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link
4.14.0-0.nightly-2023-09-02-132842	231558	291	62404e34-672e-4168-b4cc-0bd575768aad	aws	amd64	SDN	24	245	58725	294279	https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link

zshi@redhat.com pliurh request to modify the multus-daemon-config ConfigMap by removing readinessindicatorfile flag

scale down CNO deployment to 0
edit configmap to remove 80-openshift-network.conf (sdn) or 10-ovn-kubernetes.conf (ovn-k)
restart (delete) multus pod on each worker

Steps:

oc scale --replicas=0 -n openshift-network-operator deployments network-operator
oc edit cm multus-daemon-config -n openshift-multus, and remove the line "readinessindicatorfile": "/host/run/multus/cni/net.d/80-openshift-network.conf",
oc get po ~~n openshift-multus | grep multus~~ | egrep -v "multus-additional|multus-admission" | awk '{print $1}' | xargs oc delete po -n openshift-multus

Now the readinessindicatorfile flag is removed and And all multus pods are restarted

% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c
0

Test Result: p99 is better compared to without the fix(remove readinessindicatorfile) but is stall worse than ec4, avg is still bad.

OCP Version	Flexy Id	Scale Ci Job	Grafana URL	Cloud	Arch Type	Network Type	Worker Count	PODS_PER_NODE	Avg Pod Ready (ms)	P99 Pod Ready (ms)	Must-gather
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove `readinessindicatorfile` flag	232389	316	d7a754aa-4f52-49eb-80cf-907bee38a81b	aws	amd64	SDN	24	245	51775	105296	https://drive.google.com/file/d/1h-3JeZXQRO-zsgWzen6aNDQfSDqoKAs2/view?usp=drive_link

zshi@redhat.com pliurh request to set logLever to debug in additional to removing readinessindicatorfile flag

edit the cm to set "logLevel": "verbose" -> "debug" and restart all multus pods

Now the logLever is debug and And all multus pods are restarted

% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep logLevel
        "logLevel": "debug",
% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c
0

OCP Version	Flexy Id	Scale Ci Job	Grafana URL	Cloud	Arch Type	Network Type	Worker Count	PODS_PER_NODE	Avg Pod Ready (ms)	P99 Pod Ready (ms)	Must-gather
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove `readinessindicatorfile` flag and logLevel=debug	232389	320	5d1d3e6a-bfa1-4a4b-bbfc-daedc5605f7d	aws	amd64	SDN	24	245	49586	105314	https://drive.google.com/file/d/1p1PDbnqm0NlWND-komc9jbQ1PyQMeWcV/view?usp=drive_link

Edit

is blocked by

OCPBUGS-18981 multus pod failed to access host file with os.Stat()

Closed

is cloned by

OCPBUGS-19642 SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10

Closed

OCPBUGS-19679 SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10 [backport 4.14]

Closed

is depended on by

OCPBUGS-19679 SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10 [backport 4.14]

Closed

links to

openshift/multus-cni#186: OCPBUGS-18995: Move chroot from multus main process to its child processes (#1161)

RHEA-2023:7198 rpm

(1 links to)

Assignee:: Periyasamy Palanisamy

Reporter:: Qiujie Li

Need Info From:: None

Contributors:: None

QA Contact:: Qiujie Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2023/09/14 11:17 AM

Updated:: 2025/07/25 5:26 PM

Resolved:: 2024/02/27 8:51 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates