Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5358

[PAO] Possible improper tuning of PAO leading to CPU contention

XMLWordPrintable

    • Important
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      - UNKNOWN EVENT around 13:24 on December 30th
      - System Load begins to increase a little more than normal (from <10 to 30-60)
      - OVS has a hiccup likely due to event that caused CPU contention around 13:27 on December 30th
      - This increases load further (load now above 300)
      - NFS (for PVCs) flakes several times between
      - System is rebooted around 18:48 on December 30th

      Version-Release number of selected component (if applicable):

      4.7.37

      How reproducible:

      Not very as many clusters have this configuration

      Steps to Reproduce:

      According to the PerformanceProfile, it should be 8 cores reserved (kubelet and kernelArgs) :
      
      ```
         $ oc get performanceprofile perf-profile-2m-worker -o yaml             
         apiVersion: performance.openshift.io/v2
         kind: PerformanceProfile
         metadata:
           annotations:
             kubeletconfig.experimental: |
               {"systemReserved": {"memory": "16Gi", "cpu": "8", "ephemeral-storage": "1Gi"}}
         spec:
           cpu:
             isolated: 2-19,22-39,42-59,62-79
             reserved: 0,1,20,21,40,41,60,61
      ```
      
      but the MachineConfig rolled out displays 80 cores
      ```
         $ oc get mc $(oc get mcp workerperf | awk '/workerperf/ {print $2}') -o yaml | yq -e '.spec.kernelArguments'
           "tuned.non_isolcpus=0000ffff,ffffffff,ffffffff",   $ python hexmask_converter.py 0000ffffffffffffffffffff
         0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79
      ```
      
      I think it should be this
      ```
         "tuned.non_isolcpus=30000300,00300003"   $ python hexmask_converter.py 3000030000300003
         0,1,20,21,40,41,60,61
      ```In the previous case (03211783), system processes such as 'ovs-vswitchd' were isolated to 4 specific cores prior to adjustments. These cores were 0,20,40,60.  No processes from containers/pods were to use them. Later in that case, this was adjusted to 8 cores or 0,1,20,21,40,41,60,61.However, based on the above findings, I had Joe Teagno [RH Consultant] login and get me what cores 'ovs-vswitchd' was allowed to use and looks like it's allowed to use any of the 80 cores -- I would expect the affinity list to contain the cores '0,1,20,21,40,41,60,61' but instead it contains '0-79':
      ```
         [root@worker-01 ~]# hostname --fqdn
         worker-01.kub1.elsstx13.vzwops.com
         
         [root@worker-01 ~]# taskset -c -p $(ps aux | awk '/[o]vs-vswitch/ {print $2}')
         pid 2552's current affinity list: 0-79
      ```
      
      So if you had a process from a pod/container trying to use the same core(s) that ovs-vswitchd was using and ovs-vswitchd became blocked, this could cause ovs-vswitchd to dynamically throttle itself. This would then lead to pod traffic delayed and higher load and then system instability. Typically manifests in this logging output:
      ```
         $ less 0010-sosreport-worker-01-2022-12-30-tgkedrp.tar.xz/sosreport-worker-01-2022-12-30-tgkedrp/sos_commands/logs/journalctl_--no-pager
         Dec 30 13:24:27 worker-01 ovs-vswitchd[2570]: ovs|04469|ovs_rcu(urcu5)|WARN|blocked 1000 ms waiting for handler205 to quiesce
         Dec 30 13:24:28 worker-01 ovs-vswitchd[2570]: ovs|04470|ovs_rcu(urcu5)|WARN|blocked 2000 ms waiting for handler205 to quiesce
         Dec 30 13:24:30 worker-01 ovs-vswitchd[2570]: ovs|04471|ovs_rcu(urcu5)|WARN|blocked 4000 ms waiting for handler205 to quiesce
         Dec 30 13:24:34 worker-01 ovs-vswitchd[2570]: ovs|04472|ovs_rcu(urcu5)|WARN|blocked 8000 ms waiting for handler205 to quiesce
         Dec 30 13:24:42 worker-01 ovs-vswitchd[2570]: ovs|04473|ovs_rcu(urcu5)|WARN|blocked 16000 ms waiting for handler205 to quiesce
         Dec 30 13:24:58 worker-01 ovs-vswitchd[2570]: ovs|04474|ovs_rcu(urcu5)|WARN|blocked 32001 ms waiting for handler205 to quiesce
         Dec 30 13:25:30 worker-01 ovs-vswitchd[2570]: ovs|04475|ovs_rcu(urcu5)|WARN|blocked 64000 ms waiting for handler205 to quiesce
         Dec 30 13:26:34 worker-01 ovs-vswitchd[2570]: ovs|04476|ovs_rcu(urcu5)|WARN|blocked 128000 ms waiting for handler205 to quiesce
         Dec 30 13:27:29 worker-01 ovs-vswitchd[2570]: ovs|2755338|ovs_rcu|WARN|blocked 1000 ms waiting for handler205 to quiesce
         Dec 30 13:27:30 worker-01 ovs-vswitchd[2570]: ovs|2755339|ovs_rcu|WARN|blocked 2001 ms waiting for handler205 to quiesce
         Dec 30 13:27:32 worker-01 ovs-vswitchd[2570]: ovs|2755340|ovs_rcu|WARN|blocked 4001 ms waiting for handler205 to quiesce
         Dec 30 13:27:36 worker-01 ovs-vswitchd[2570]: ovs|2755341|ovs_rcu|WARN|blocked 8000 ms waiting for handler205 to quiesce
         Dec 30 13:27:44 worker-01 ovs-vswitchd[2570]: ovs|2755342|ovs_rcu|WARN|blocked 16000 ms waiting for handler205 to quiesce
         
         $ journalctl --file 0010-sosreport-worker-01-2022-12-30-tgkedrp.tar.xz/sosreport-worker-01-2022-12-30-tgkedrp/var/log/journal/ebdaaa6a15e74934ab46801b03c54f1c/system@0005f10d9d553026-4d6eb778c3078e7f.journal~ | less
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00663|ofproto_dpif_upcall(handler164)|INFO|received packet on unassociated datapath port 4294967295
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00112|ofproto_dpif_upcall(handler167)|INFO|received packet on unassociated datapath port 4294967295
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|01625|ofproto_dpif_upcall(handler163)|INFO|received packet on unassociated datapath port 4294967295
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00036|ofproto_dpif_upcall(handler169)|INFO|received packet on unassociated datapath port 4294967295
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|01626|ofproto_dpif_upcall(handler163)|INFO|received packet on unassociated datapath port 4294967295
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00269|ofproto_dpif_upcall(revalidator221)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:3c42e3d0-4deb-4aec-9a81-20ac2203f615
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00270|ofproto_dpif_upcall(revalidator221)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:58da8da4-69b5-4fcb-a510-edec0e4cc660
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00044|ofproto_dpif_upcall(revalidator223)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:0d07ff63-cb49-4017-aa87-a2ca23bce173
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00050|ofproto_dpif_upcall(revalidator226)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:c59e776c-42cf-42ea-973d-81703da83dd0
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00271|ofproto_dpif_upcall(revalidator221)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:40f3b297-351c-41ce-a031-88e70b214b0f
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00051|ofproto_dpif_upcall(revalidator226)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:0686536b-a7be-4e7b-9a25-56c4d91ebed8
         Dec 30 13:27:28 worker-01 ovs-vswitchd[2570]: ovs|00272|ofproto_dpif_upcall(revalidator221)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:bc0affb0-a985-4a7d-86b1-7ee08d032bce
      ```
      
      However, after speaking with Nabeel Cocker [RH Consultant], he mentioned that the OVS / OVN-Kubernetes team used this as a workaround to open up all cores to 'ovs-vswitchd' to try to prevent the issues seen in case 03211783.
      
      I think we will need to go internally to speak with members of the PAO team as well as some of the consultants who worked closely with getting this setup on whether on whether this is the path forward or if there's another pivot/play needed here to mitigate this issue as well to prevent OVS processes from contention with Pod/Container processes by way of the tuning in the Performance Add-On Operator. 

      Actual results:

      OVS fights over the core with some process. We see 'stalld' go into action at this time as well

      Expected results:

      Node never has an issue with OVS due to proper tuning of PAO

      Additional info:

       

              atheurer@redhat.com Andrew Theurer
              rhn-support-acardena Albert Cardenas
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: