-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.14, 4.15
-
Critical
-
No
-
Approved
-
True
-
Description of problem:
This is to track the SDN specific issue in https://issues.redhat.com/browse/OCPBUGS-18389 4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.z in node-density (lite) test
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-11-201102
How reproducible:
Everytime
Steps to Reproduce:
1. Install a SDN cluster and scale up to 24 worker nodes, install 3 infra nodes and move monitoring, ingress, registry components to infra nodes. 2. Run node-density (lite) test with 245 pod per node 3. Compare the pod ready latency to 4.13.z, and 4.14 ec4
Actual results:
4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.10
Expected results:
4.14 should have similar pod ready latency compared to previous release
Additional info:
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-ec.4 | 231559 | 292 | 087eb40c-6600-4db3-a9fd-3b959f4a434a | aws | amd64 | SDN | 24 | 245 | 2186 | 3256 | https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link |
4.14.0-0.nightly-2023-09-02-132842 | 231558 | 291 | 62404e34-672e-4168-b4cc-0bd575768aad | aws | amd64 | SDN | 24 | 245 | 58725 | 294279 | https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link |
With the new multus image provided by dcbw@redhat.com in https://issues.redhat.com/browse/OCPBUGS-18389, SDN 24 nodes's latency is similar to without the fix.
% oc -n openshift-network-operator get deployment.apps/network-operator -o yaml | grep MULTUS_IMAGE -A 1 - name: MULTUS_IMAGE value: quay.io/dcbw/multus-cni:informer % oc get pod -n openshift-multus -o yaml | grep image: | grep multus image: quay.io/dcbw/multus-cni:informer ....
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer | 232389 | 314 | f2c290c1-73ea-4f10-a797-3ab9d45e94b3 | aws | amd64 | SDN | 24 | 245 | 61234 | 311776 | https://drive.google.com/file/d/1o7JXJAd_V3Fzw81pTaLXQn1ms44lX6v5/view?usp=drive_link |
4.14.0-ec.4 | 231559 | 292 | 087eb40c-6600-4db3-a9fd-3b959f4a434a | aws | amd64 | SDN | 24 | 245 | 2186 | 3256 | https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link |
4.14.0-0.nightly-2023-09-02-132842 | 231558 | 291 | 62404e34-672e-4168-b4cc-0bd575768aad | aws | amd64 | SDN | 24 | 245 | 58725 | 294279 | https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link |
zshi@redhat.com pliurh request to modify the multus-daemon-config ConfigMap by removing readinessindicatorfile flag
- scale down CNO deployment to 0
- edit configmap to remove 80-openshift-network.conf (sdn) or 10-ovn-kubernetes.conf (ovn-k)
- restart (delete) multus pod on each worker
Steps:
- oc scale --replicas=0 -n openshift-network-operator deployments network-operator
- oc edit cm multus-daemon-config -n openshift-multus, and remove the line "readinessindicatorfile": "/host/run/multus/cni/net.d/80-openshift-network.conf",
- oc get po
n openshift-multus | grep multus| egrep -v "multus-additional|multus-admission" | awk '{print $1}' | xargs oc delete po -n openshift-multus
Now the readinessindicatorfile flag is removed and And all multus pods are restarted
% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c 0
Test Result: p99 is better compared to without the fix(remove readinessindicatorfile) but is stall worse than ec4, avg is still bad.
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag | 232389 | 316 | d7a754aa-4f52-49eb-80cf-907bee38a81b | aws | amd64 | SDN | 24 | 245 | 51775 | 105296 | https://drive.google.com/file/d/1h-3JeZXQRO-zsgWzen6aNDQfSDqoKAs2/view?usp=drive_link |
zshi@redhat.com pliurh request to set logLever to debug in additional to removing readinessindicatorfile flag
edit the cm to set "logLevel": "verbose" -> "debug" and restart all multus pods
Now the logLever is debug and And all multus pods are restarted
% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep logLevel "logLevel": "debug", % oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c 0
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag and logLevel=debug | 232389 | 320 | 5d1d3e6a-bfa1-4a4b-bbfc-daedc5605f7d | aws | amd64 | SDN | 24 | 245 | 49586 | 105314 | https://drive.google.com/file/d/1p1PDbnqm0NlWND-komc9jbQ1PyQMeWcV/view?usp=drive_link |
- clones
-
OCPBUGS-18995 SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10
- Closed
- depends on
-
OCPBUGS-18995 SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10
- Closed
- duplicates
-
OCPBUGS-19642 SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10
- Closed
- links to
-
RHSA-2023:5006 OpenShift Container Platform 4.14.z security update