Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32025

ovnkube_clustermanager_egress_ips_rebalance_total metric not incrementing correctly

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      During EIP failover scale testing, when the environment has 24000 EIPs and each node has 200 EIPs, we observed incorrect increment value for ovnkube_clustermanager_egress_ips_rebalance_total. This metric is incremented by only 64 though the node which is restarted is having 200 EIPs.
      
      Test creates 24000 namespaces wherein for each namespace it creates 1 EIP object with 1 EIP address and 1 Deployment with 1 pod replicas. Thus test creates 24000 namespaces, 24000 EIPs, 24000 pods. We observed that each node has 200 EIPs and 200 pods
      [root@f20-h01-000-r640 kube-burner-ocp]# oc get egressip -oyaml | grep node:  | sort | uniq -c
          200       node: f20-h11-000-r640.rdu2.scalelab.redhat.com
          200       node: f20-h14-000-r640.rdu2.scalelab.redhat.com
          200       node: f20-h17-000-r640.rdu2.scalelab.redhat.com
          200       node: f20-h18-000-r640.rdu2.scalelab.redhat.com
          200       node: f20-h19-000-r640.rdu2.scalelab.redhat.com
          200       node: f20-h21-000-r640.rdu2.scalelab.redhat.com
          200       node: f20-h22-000-r640.rdu2.scalelab.redhat.com
          200       node: f20-h25-000-r640.rdu2.scalelab.redhat.com
          200       node: f20-h27-000-r640.rdu2.scalelab.redhat.com
          200       node: f20-h29-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h01-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h03-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h05-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h06-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h10-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h11-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h13-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h14-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h15-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h17-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h18-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h19-000-r640.rdu2.scalelab.redhat.com
          200       node: f21-h21-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h06-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h07-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h09-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h10-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h11-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h13-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h14-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h15-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h18-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h19-000-r640.rdu2.scalelab.redhat.com
          200       node: f30-h21-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h01-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h02-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h03-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h05-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h06-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h07-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h09-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h10-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h11-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h13-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h14-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h15-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h17-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h18-000-r640.rdu2.scalelab.redhat.com
          200       node: f31-h19-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h01-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h02-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h03-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h05-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h06-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h07-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h09-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h10-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h11-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h13-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h14-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h15-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h17-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h18-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h19-000-r640.rdu2.scalelab.redhat.com
          200       node: f32-h21-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h01-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h02-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h05-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h06-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h07-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h09-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h10-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h11-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h13-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h14-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h15-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h17-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h18-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h19-000-r640.rdu2.scalelab.redhat.com
          200       node: f33-h21-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h01-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h02-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h03-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h05-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h06-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h07-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h09-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h10-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h11-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h13-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h14-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h15-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h17-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h18-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h19-000-r640.rdu2.scalelab.redhat.com
          200       node: f34-h21-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h01-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h02-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h03-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h05-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h06-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h07-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h09-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h10-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h11-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h13-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h14-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h15-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h17-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h19-000-r640.rdu2.scalelab.redhat.com
          200       node: f35-h21-000-r640.rdu2.scalelab.redhat.com
          200       node: f36-h01-000-r640.rdu2.scalelab.redhat.com
          200       node: f36-h02-000-r640.rdu2.scalelab.redhat.com
          200       node: f36-h03-000-r640.rdu2.scalelab.redhat.com
          200       node: f36-h05-000-r640.rdu2.scalelab.redhat.com
          200       node: f36-h06-000-r640.rdu2.scalelab.redhat.com
          200       node: f36-h07-000-r640.rdu2.scalelab.redhat.com
          200       node: f36-h09-000-r640.rdu2.scalelab.redhat.com
          200       node: f36-h10-000-r640.rdu2.scalelab.redhat.com
          200       node: f36-h11-000-r640.rdu2.scalelab.redhat.com
      
      We have rebooted node f36-h09-000-r640.rdu2.scalelab.redhat.com which has 200 EIPs
      [root@f20-h01-000-r640 kube-burner-ocp]# oc get egressip -oyaml | grep "node: f36-h09-000-r640.rdu2.scalelab.redhat.com"  | sort | uniq -c
          200       node: f36-h09-000-r640.rdu2.scalelab.redhat.com
      
      But we see this metric increased by only 64 instead of 200.
      
      EIP count on the node after reboot is empty as expected
      [root@f20-h01-000-r640 kube-burner-ocp]# oc get egressip -oyaml | grep "node: f36-h09-000-r640.rdu2.scalelab.redhat.com"  | sort | uniq -c
      [root@f20-h01-000-r640 kube-burner-ocp]# 
      
      
      More details about the testing and observations are at https://docs.google.com/document/d/17NGv6pR-3VFVD5hFzpcdYt5hRZChOSIzmm0BPc-Rdx8/edit?usp=sharing 
      
      Environment details -
      OCP deployed on bare metal nodes - 120 workers, 2 infra, 3 masters.
      All nodes have same configuration -
      CPUs: 80   Memory: 384G   NIC bandwidth: 25 gb/s    

      Version-Release number of selected component (if applicable):

          4.16

      How reproducible:

          Always

      Steps to Reproduce:

      Perf & OVN team developed a custom workload for EIP scale testing. We have the OCP deployment now to debug this issue.    

      Actual results:

          This metric's incremented count is less than the EIPs assigned to node which restarted.

      Expected results:

          This metric's incremented count should be qual to  EIPs assigned to node which restarted.

      Additional info:

          

            sdn-team-bot sdn-team bot
            vkommadi@redhat.com VENKATA ANIL kumar KOMMADDI
            Sachin Ninganure Sachin Ninganure
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: