Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15846

[OVN-IC] regression in services creation at scale compared to IC v3 image

XMLWordPrintable

    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None

      This is a ovn-ic 120 node environment with 4.14.0-0.nightly-2023-06-30-131338.
      node-density-cni test is ran on 120 node env with 80 pods-per-node. It tries to create 4015 deployments (each with 2 pods) and 4015 servies. However service creation stated failing after 1464 (i.e Service/webserver-1-1464). 
      [2023-07-02T15:13:26.029+0000] {subprocess.py:93} INFO - time="2023-07-02 15:13:26" level=error msg="Error creating object Service/webserver-1-1464 in namespace 42242802-node-density-cni-20230702: Post \"https://api.venkataanil-ovn-ic-4.14-aws-ovn-medium-cp.perfscale.devcluster.openshift.com:6443/api/v1/namespaces/42242802-node-density-cni-20230702/services?timeout=15s\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

      Randomly some services were created  between 1464 and 4015. For example, Service/webserver-1-3645 was created succesfully.
      [2023-07-02T15:20:59.328+0000] {subprocess.py:93} INFO - time="2023-07-02 15:20:59" level=error msg="Error creating object Service/webserver-1-3644 in namespace 42242802-node-density-cni-20230702: Post \"https://api.venkataanil-ovn-ic-4.14-aws-ovn-medium-cp.perfscale.devcluster.openshift.com:6443/api/v1/namespaces/42242802-node-density-cni-20230702/services?timeout=15s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
      [2023-07-02T15:21:00.178+0000] {subprocess.py:93} INFO - time="2023-07-02 15:21:00" level=error msg="Error creating object Service/webserver-1-3648 in namespace 42242802-node-density-cni-20230702: Post \"https://api.venkataanil-ovn-ic-4.14-aws-ovn-medium-cp.perfscale.devcluster.openshift.com:6443/api/v1/namespaces/42242802-node-density-cni-20230702/services?timeout=15s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

      We observed a dip in CPU idle time (reaching 0.06%) and high user cpu (726% out of 800%) in workers during this duration. Grafana dashboard
      https://grafana.rdu2.scalelab.redhat.com:3000/dashboard/snapshot/8s5LxbA0r3tO3TAqnaRGN5IqTYL3rFQ9
      https://grafana.rdu2.scalelab.redhat.com:3000/d/FwPsenaaa/kube-burner-report-icv3?orgId=1&from=1688310000000&to=1688324399000&var-Datasource=AWS+Pro+-+ripsaw-kube-burner&var-platform=&var-platform=AWS&var-sdn=&var-sdn=OVNKubernetes&var-workload=node-density-cni&var-worker_nodes=120&var-uuid=42242802-node-density-cni-20230702&var-master=ip-10-0-150-139.us-west-2.compute.internal&var-worker=ip-10-0-128-149.us-west-2.compute.internal&var-infra=ip-10-0-129-136.us-west-2.compute.internal&var-namespace=All&var-latencyPercentile=P99

      must-gather for above run http://ec2-54-212-114-216.us-west-2.compute.amazonaws.com:7070/index/venkataanil/4.14-aws-ovn-medium-cp/manual__2023-07-02T10:03:38.526493%2B00:00-AWS-4.14.0-ovnkubernetes/must_gather/2023-07-02_06:30_PM/must-gather-2023-07-02_06-17_PM.tar.xz

      Similar behaviour observed in another test run where it started failing after 1868 (i.e Service/webserver-1-1868)
      [2023-07-02, 12:28:21 UTC] {subprocess.py:93} INFO - time="2023-07-02 12:28:21" level=error msg="Error creating object Service/webserver-1-1868 in namespace e2a73d17-node-density-cni-20230702: Post \"https://api.venkataanil-ovn-ic-cni-4.14-aws-ovn-medium-cp.perfscale.devcluster.openshift.com:6443/api/v1/namespaces/e2a73d17-node-density-cni-20230702/services?timeout=15s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
      [2023-07-02, 12:28:21 UTC] {subprocess.py:93} INFO - time="2023-07-02 12:28:21" level=error msg="Retrying object creation"

      must-gather for the second run http://ec2-54-212-114-216.us-west-2.compute.amazonaws.com:7070/index/venkataanil/4.14-aws-ovn-medium-cp/manual__2023-07-02T11:20:21.494581%2B00:00-AWS-4.14.0-ovnkubernetes/must_gather/2023-07-02_01:57_PM/must-gather-2023-07-02_01-39_PM.tar.xz

      So this test (creating services) is consitently failing at scale.

      This is a regression with OVN-IC as this issue is not seen on{}

      1.  OVN legacy environment created with same nightly image
      2. OVN IC environemnt created with OVN v3 image (quay.io/itssurya/dev-images:ic-scale-v3){}

      must-gather with OVN v3 image (which had succesful run) http://storage.scalelab.redhat.com/anilvenkata/must-gather-icv3-cni.tar.xz

      and grafana dashboard https://grafana.rdu2.scalelab.redhat.com:3000/d/FwPsenaaa/kube-burner-report-icv3?orgId=1&from=1688651100000&to=1688652959000 

              sseethar Surya Seetharaman
              vkommadi@redhat.com VENKATA ANIL kumar KOMMADDI
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: