Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19679

SDN: 4.14 after ec4 has a higher pod ready latency compared to 4.13.10 [backport 4.14]

XMLWordPrintable

    • Critical
    • No
    • Approved
    • True
    • Hide

      May have impact on annotation admission controller functionality

      Show
      May have impact on annotation admission controller functionality

      Description of problem:

      This is to track the SDN specific issue in https://issues.redhat.com/browse/OCPBUGS-18389
      
      4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.z in node-density (lite) test

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-2023-09-11-201102

      How reproducible:

      Everytime

      Steps to Reproduce:

      1. Install a SDN cluster and scale up to 24 worker nodes, install 3 infra nodes and move monitoring, ingress, registry components to infra nodes. 
      2. Run node-density (lite) test with 245 pod per node
      3. Compare the pod ready latency to 4.13.z, and 4.14 ec4 

      Actual results:

      4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.10

      Expected results:

      4.14 should have similar pod ready latency compared to previous release

      Additional info:

       
      OCP Version Flexy Id Scale Ci Job Grafana URL Cloud Arch Type Network Type Worker Count PODS_PER_NODE Avg Pod Ready (ms) P99 Pod Ready (ms) Must-gather
      4.14.0-ec.4 231559 292 087eb40c-6600-4db3-a9fd-3b959f4a434a aws amd64 SDN 24 245 2186 3256 https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link
      4.14.0-0.nightly-2023-09-02-132842 231558 291 62404e34-672e-4168-b4cc-0bd575768aad aws amd64 SDN 24 245 58725 294279 https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link

       

      With the new multus image provided by dcbw@redhat.com in https://issues.redhat.com/browse/OCPBUGS-18389, SDN 24 nodes's latency is similar to without the fix. 

      % oc -n openshift-network-operator get deployment.apps/network-operator -o yaml | grep MULTUS_IMAGE -A 1
              - name: MULTUS_IMAGE
                value: quay.io/dcbw/multus-cni:informer 
       % oc get pod -n openshift-multus -o yaml | grep image: | grep multus
            image: quay.io/dcbw/multus-cni:informer
      ....
      OCP Version Flexy Id Scale Ci Job Grafana URL Cloud Arch Type Network Type Worker Count PODS_PER_NODE Avg Pod Ready (ms) P99 Pod Ready (ms) Must-gather
      4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer 232389 314 f2c290c1-73ea-4f10-a797-3ab9d45e94b3 aws amd64 SDN 24 245 61234 311776 https://drive.google.com/file/d/1o7JXJAd_V3Fzw81pTaLXQn1ms44lX6v5/view?usp=drive_link
      4.14.0-ec.4 231559 292 087eb40c-6600-4db3-a9fd-3b959f4a434a aws amd64 SDN 24 245 2186 3256 https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link
      4.14.0-0.nightly-2023-09-02-132842 231558 291 62404e34-672e-4168-b4cc-0bd575768aad aws amd64 SDN 24 245 58725 294279 https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link

       

      zshi@redhat.com pliurh request to modify the multus-daemon-config ConfigMap by removing readinessindicatorfile flag

      1. scale down CNO deployment to 0
      2. edit configmap to remove 80-openshift-network.conf (sdn) or 10-ovn-kubernetes.conf (ovn-k)
      3. restart (delete) multus pod on each worker

      Steps:

      1. oc scale --replicas=0 -n openshift-network-operator deployments network-operator
      2. oc edit cm multus-daemon-config -n openshift-multus, and remove the line "readinessindicatorfile": "/host/run/multus/cni/net.d/80-openshift-network.conf",
      3. oc get po n openshift-multus | grep multus | egrep -v "multus-additional|multus-admission" | awk '{print $1}' | xargs oc delete po -n openshift-multus

      Now the readinessindicatorfile flag is removed and And all multus pods are restarted

       

      % oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c
      0  

      Test Result: p99 is better compared to without the fix(remove readinessindicatorfile) but is stall worse than ec4, avg is still bad.
       

      OCP Version Flexy Id Scale Ci Job Grafana URL Cloud Arch Type Network Type Worker Count PODS_PER_NODE Avg Pod Ready (ms) P99 Pod Ready (ms) Must-gather
      4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag 232389 316 d7a754aa-4f52-49eb-80cf-907bee38a81b aws amd64 SDN 24 245 51775 105296 https://drive.google.com/file/d/1h-3JeZXQRO-zsgWzen6aNDQfSDqoKAs2/view?usp=drive_link

      zshi@redhat.com pliurh request to set logLever to debug in additional to removing readinessindicatorfile flag

      edit the cm to set "logLevel": "verbose" -> "debug" and restart all multus pods

      Now the logLever is debug and And all multus pods are restarted

      % oc get cm multus-daemon-config -n openshift-multus -o yaml | grep logLevel
              "logLevel": "debug",
      % oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c
      0 
      OCP Version Flexy Id Scale Ci Job Grafana URL Cloud Arch Type Network Type Worker Count PODS_PER_NODE Avg Pod Ready (ms) P99 Pod Ready (ms) Must-gather
      4.14.0-0.nightly-2023-09-11-201102  quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag and logLevel=debug 232389 320 5d1d3e6a-bfa1-4a4b-bbfc-daedc5605f7d aws amd64 SDN 24 245 49586 105314 https://drive.google.com/file/d/1p1PDbnqm0NlWND-komc9jbQ1PyQMeWcV/view?usp=drive_link

       
      Edit

            pliurh Peng Liu
            rhn-support-qili Qiujie Li
            Qiujie Li Qiujie Li
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: