Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.14.0
Affects Version/s: 4.14, 4.15
Component/s: Networking / openshift-sdn
Labels:

Severity:
Critical
Regression:
No
Release Blocker:
Approved
Blocked:
True
Blocked Reason:

Hide

May have impact on annotation admission controller functionality

Show
May have impact on annotation admission controller functionality
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

This is to track the SDN specific issue in https://issues.redhat.com/browse/OCPBUGS-18389

4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.z in node-density (lite) test

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-11-201102

How reproducible:

Everytime

Steps to Reproduce:

1. Install a SDN cluster and scale up to 24 worker nodes, install 3 infra nodes and move monitoring, ingress, registry components to infra nodes. 
2. Run node-density (lite) test with 245 pod per node
3. Compare the pod ready latency to 4.13.z, and 4.14 ec4

Actual results:

4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.10

Expected results:

4.14 should have similar pod ready latency compared to previous release

Additional info:

OCP Version	Flexy Id	Scale Ci Job	Grafana URL	Cloud	Arch Type	Network Type	Worker Count	PODS_PER_NODE	Avg Pod Ready (ms)	P99 Pod Ready (ms)	Must-gather
4.14.0-ec.4	231559	292	087eb40c-6600-4db3-a9fd-3b959f4a434a	aws	amd64	SDN	24	245	2186	3256	https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link
4.14.0-0.nightly-2023-09-02-132842	231558	291	62404e34-672e-4168-b4cc-0bd575768aad	aws	amd64	SDN	24	245	58725	294279	https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link

With the new multus image provided by dcbw@redhat.com in https://issues.redhat.com/browse/OCPBUGS-18389, SDN 24 nodes's latency is similar to without the fix.

% oc -n openshift-network-operator get deployment.apps/network-operator -o yaml | grep MULTUS_IMAGE -A 1
        - name: MULTUS_IMAGE
          value: quay.io/dcbw/multus-cni:informer 
 % oc get pod -n openshift-multus -o yaml | grep image: | grep multus
      image: quay.io/dcbw/multus-cni:informer
....

OCP Version	Flexy Id	Scale Ci Job	Grafana URL	Cloud	Arch Type	Network Type	Worker Count	PODS_PER_NODE	Avg Pod Ready (ms)	P99 Pod Ready (ms)	Must-gather
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer	232389	314	f2c290c1-73ea-4f10-a797-3ab9d45e94b3	aws	amd64	SDN	24	245	61234	311776	https://drive.google.com/file/d/1o7JXJAd_V3Fzw81pTaLXQn1ms44lX6v5/view?usp=drive_link
4.14.0-ec.4	231559	292	087eb40c-6600-4db3-a9fd-3b959f4a434a	aws	amd64	SDN	24	245	2186	3256	https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link
4.14.0-0.nightly-2023-09-02-132842	231558	291	62404e34-672e-4168-b4cc-0bd575768aad	aws	amd64	SDN	24	245	58725	294279	https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link

zshi@redhat.com pliurh request to modify the multus-daemon-config ConfigMap by removing readinessindicatorfile flag

scale down CNO deployment to 0
edit configmap to remove 80-openshift-network.conf (sdn) or 10-ovn-kubernetes.conf (ovn-k)
restart (delete) multus pod on each worker

Steps:

oc scale --replicas=0 -n openshift-network-operator deployments network-operator
oc edit cm multus-daemon-config -n openshift-multus, and remove the line "readinessindicatorfile": "/host/run/multus/cni/net.d/80-openshift-network.conf",
oc get po ~~n openshift-multus | grep multus~~ | egrep -v "multus-additional|multus-admission" | awk '{print $1}' | xargs oc delete po -n openshift-multus

Now the readinessindicatorfile flag is removed and And all multus pods are restarted

% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c
0

Test Result: p99 is better compared to without the fix(remove readinessindicatorfile) but is stall worse than ec4, avg is still bad.

OCP Version	Flexy Id	Scale Ci Job	Grafana URL	Cloud	Arch Type	Network Type	Worker Count	PODS_PER_NODE	Avg Pod Ready (ms)	P99 Pod Ready (ms)	Must-gather
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove `readinessindicatorfile` flag	232389	316	d7a754aa-4f52-49eb-80cf-907bee38a81b	aws	amd64	SDN	24	245	51775	105296	https://drive.google.com/file/d/1h-3JeZXQRO-zsgWzen6aNDQfSDqoKAs2/view?usp=drive_link

zshi@redhat.com pliurh request to set logLever to debug in additional to removing readinessindicatorfile flag

edit the cm to set "logLevel": "verbose" -> "debug" and restart all multus pods

Now the logLever is debug and And all multus pods are restarted

% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep logLevel
        "logLevel": "debug",
% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c
0

OCP Version	Flexy Id	Scale Ci Job	Grafana URL	Cloud	Arch Type	Network Type	Worker Count	PODS_PER_NODE	Avg Pod Ready (ms)	P99 Pod Ready (ms)	Must-gather
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove `readinessindicatorfile` flag and logLevel=debug	232389	320	5d1d3e6a-bfa1-4a4b-bbfc-daedc5605f7d	aws	amd64	SDN	24	245	49586	105314	https://drive.google.com/file/d/1p1PDbnqm0NlWND-komc9jbQ1PyQMeWcV/view?usp=drive_link

Edit

clones

OCPBUGS-18995 SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10

Closed

depends on

OCPBUGS-18995 SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10

Closed

duplicates

OCPBUGS-19642 SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10

Closed

links to

openshift/multus-cni#189: OCPBUGS-19679: Move chroot from multus main process to its child processes

RHSA-2023:5006 OpenShift Container Platform 4.14.z security update

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates