Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-54610

During OCP upgrade from 4.12 to 4.14, with SDN CO still at 4.13, iptable-restore takes x5 more time (for the same number of svcs/pods)

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Context:

      Before staring the upgrade the cluster was:

      • all COs  at 4.12.30 (incl. SDN CO)
      • all masters at 4.12.30 / CoreOS 412.86 (RHEL 8.6)
      • all other nodes  at 4.12.30 / CoreOS 412.86 (RHEL 8.6)

      Then upgrade from 4.12.30 to 4.14.39 started.

      it was paused because of an issue on quay.io which prevented images puling.

      At the moment, the situation is:

      • all COs updated to 4.13.52 (incl. SDN CO)
      • all masters updated to 4.13.52 / CoreOS 413.92 (RHEL 9.2)
      • all other nodes still at 4.12.30 / CoreOS 412.86 (RHEL 8.6)

      Issue

      We now see constantly higher iptables-restore time after sdn upgrade from 4.12 to 4.13:

      1. Worker1 - before sdn upgrade (that is, SDN CO still at 4.12):
        ~~~
        $ omc get pod -n openshift-sdn -o wide | grep worker1 | awk '{print $1}' | xargs -I {} omc logs -n openshift-sdn {} -c sdn | grep 'iptables restore' | grep -oE 'total time.*' | head
        total time: 2841ms):
        total time: 2048ms):
        total time: 2004ms):
        total time: 2041ms):
        total time: 2579ms):
        total time: 2639ms):
        total time: 2018ms):
        total time: 2029ms):
        total time: 2891ms):
        total time: 2850ms):
        ~~~
      1. Worker1 - after sdn upgrade (that is, SDN CO still at 4.13):
        ~~~
        $ omc get pod -n openshift-sdn -o wide | grep worker1 | awk '{print $1}' | xargs -I {} omc logs -n openshift-sdn {} -c sdn | grep 'iptables restore' | grep -oE 'total time.*' | head
        total time: 6361ms):
        total time: 11866ms):
        total time: 11121ms):
        total time: 9864ms):
        total time: 10564ms):
        total time: 10325ms):
        total time: 10485ms):
        total time: 10224ms):
        total time: 11450ms):
        total time: 11144ms):
        ~~~
         

      NOTE

      Before the starting of the upgrade, all nodes were CoreOS 412.86 (RHEL 8.6) and there was no issue
      Now we have a mix of CoreOS 412.86 (RHEL 8.4) / CoreOS 413.92 (RHEL 9.2)
      and we see the issue on BOTH the two types of nodes, therefore we suspect is something related to the CNI and not the nodes OS version.

              dwinship@redhat.com Dan Winship
              rhn-support-acancell Alfonso Cancellara
              None
              None
              Zhanqi Zhao Zhanqi Zhao
              None
              Giovan Battista Salinetti
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: