Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14247

[OVNK][CI] e2e-aws-ovn-upgrade-local-gateway is broken and perma failing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • None
    • 4.14.0
    • None
    • Important
    • No
    • SDN Sprint 237, SDN Sprint 238, SDN Sprint 239
    • 3
    • Approved
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway has been perma-failing downstream on master: 
      https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway

      Job History: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway

      My intuition is that this started perma-failing after we merged https://github.com/openshift/machine-config-operator/pull/3676/files

      Debugging notes so far:

      • triggered a LGW upgrade
      • there is a point where I did check startup logs in OVNK and saw this happening:
      • 2023-05-27T17:22:04.794324968Z I0527 17:22:04.794309 4541 ovs.go:200] Exec(59): /usr/sbin/sysctl -w net.ipv4.conf.br-ex.forwarding=1
        2023-05-27T17:22:04.794857428Z I0527 17:22:04.794835 4541 ovs.go:203] Exec(59): stdout: "net.ipv4.conf.br-ex.forwarding = 1\n"
        2023-05-27T17:22:04.794857428Z I0527 17:22:04.794850 4541 ovs.go:204] Exec(59): stderr: ""
        So we are setting the forwarding to 1, but at some point during the upgrade in OCP for lgw, this is reset to 0 -> not sure why, this leads to all operators getting degraded and upgrade doesn't finish
      • I had to restart ovnkube-node pods so that it resets this to 1 and upgrade completes, but the disruption tests fail, we need to look into why this is happening.

      Version-Release number of selected component (if applicable):

      4.14

      How reproducible:

      Always

      Steps to Reproduce:

      1. Trigger a PR that will run the lgw upgrade job
      2.
      3.
      

      Actual results:

      upgrade fails

      Expected results:

      upgrade should pass

      Additional info:

      During cluster install: On a master node:
      [surya@hidden-temple ~]$ oc -n openshift-ovn-kubernetes debug node/ip-10-0-145-222.us-west-2.compute.internal
      Starting pod/ip-10-0-145-222us-west-2computeinternal-debug ...
      To use host binaries, run `chroot /host`
      Pod IP: 10.0.145.222
      If you don't see a command prompt, try pressing enter.
      sh-4.4# chroot /host
      sh-5.1# sysctl -a | grep forwarding | grep mp0 | grep br-ex
      sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
      sh-5.1# sysctl -a | grep forwarding | grep mp0 
      sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
      net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0
      net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
      net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0
      net.ipv6.conf.ovn-k8s-mp0.forwarding = 0
      net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0
      sh-5.1# sysctl -a | grep forwarding | grep br-ex
      sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
      net.ipv4.conf.br-ex.bc_forwarding = 0
      net.ipv4.conf.br-ex.forwarding = 1
      net.ipv4.conf.br-ex.mc_forwarding = 0
      net.ipv6.conf.br-ex.forwarding = 0
      net.ipv6.conf.br-ex.mc_forwarding = 0
      
      Other options:
      net.ipv4.conf.all.forwarding = 1                                                                                                   net.ipv4.ip_forward = 1
                           
      
      On a worker node:
      [surya@hidden-temple ~]$ oc -n openshift-ovn-kubernetes debug node/ip-10-0-135-158.us-west-2.compute.internal
      Starting pod/ip-10-0-135-158us-west-2computeinternal-debug ...
      To use host binaries, run `chroot /host`
      Pod IP: 10.0.135.158
      If you don't see a command prompt, try pressing enter.
      sh-4.4# chroot /host
      sh-5.1# sysctl -a | grep forwarding | grep br-ex
      sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
      net.ipv4.conf.br-ex.bc_forwarding = 0
      net.ipv4.conf.br-ex.forwarding = 1
      net.ipv4.conf.br-ex.mc_forwarding = 0
      net.ipv6.conf.br-ex.forwarding = 0
      net.ipv6.conf.br-ex.mc_forwarding = 0
      sh-5.1# sysctl -a | grep forwarding | grep mp0
      sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
      net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0
      net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
      net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0
      net.ipv6.conf.ovn-k8s-mp0.forwarding = 0
      net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0
      
      net.ipv4.conf.all.forwarding = 1 
      
      =========
      A bit later on: etcd                                       4.14.0-0.ci.test-2023-05-30-055806-ci-op-cztlhsfm-initial   True        True          False      15m     NodeInstallerProgressing: 1 nodes are at revision 6; 2 nodes are at revision 7
      same way kube-apiserver, kube-controler-manager etc start their node installer work..
      
      ====
      oc -n openshift-ovn-kubernetes debug node/ip-10-0-135-158.us-west-2.compute.internal                                                                
      Starting pod/ip-10-0-135-158us-west-2computeinternal-debug ...                                                                                                               
      To use host binaries, run `chroot /host`                                                                                                                                     
      Pod IP: 10.0.135.158                                                                                                                                                         
      If you don't see a command prompt, try pressing enter.                                                                                                                       
      sh-4.4# chroot /host                                                                                           
      sh-5.1# sysctl -a | grep forwarding | grep br-ex
      net.ipv4.conf.br-ex.bc_forwarding = 0
      net.ipv4.conf.br-ex.forwarding = 1
      net.ipv4.conf.br-ex.mc_forwarding = 0
      net.ipv6.conf.br-ex.forwarding = 0
      net.ipv6.conf.br-ex.mc_forwarding = 0
      sh-5.1# sysctl -a | grep forwarding | grep br-ex
      net.ipv4.conf.br-ex.bc_forwarding = 0
      net.ipv4.conf.br-ex.forwarding = 0
      net.ipv4.conf.br-ex.mc_forwarding = 0
      net.ipv6.conf.br-ex.forwarding = 0
      net.ipv6.conf.br-ex.mc_forwarding = 0
      sh-5.1# sysctl -w net.ipv4.conf.br-ex.forwarding=1
      net.ipv4.conf.br-ex.forwarding = 1
      sh-5.1# sysctl -w net.ipv4.conf.ovn-k8s-mp0.forwarding=1
      net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
      
      ====
      [surya@hidden-temple ~]$ oc debug node/ip-10-0-151-194.us-west-2.compute.internal -n openshift-ovn-kubernetes
      Starting pod/ip-10-0-151-194us-west-2computeinternal-debug ...
      To use host binaries, run `chroot /host`
      Pod IP: 10.0.151.194
      If you don't see a command prompt, try pressing enter.
      sh-4.4# chroot /host
      sh-5.1# sysctl -a | grep forwarding | grep mp0
      net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0
      net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
      net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0
      net.ipv6.conf.ovn-k8s-mp0.forwarding = 0
      net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0
      sh-5.1# sysctl -a | grep forwarding | grep br-ex
      net.ipv4.conf.br-ex.bc_forwarding = 0
      net.ipv4.conf.br-ex.forwarding = 1
      net.ipv4.conf.br-ex.mc_forwarding = 0
      net.ipv6.conf.br-ex.forwarding = 0
      net.ipv6.conf.br-ex.mc_forwarding = 0
      sh-5.1# sysctl -a | grep forwarding | grep br-ex
      net.ipv4.conf.br-ex.bc_forwarding = 0
      net.ipv4.conf.br-ex.forwarding = 0
      net.ipv4.conf.br-ex.mc_forwarding = 0
      net.ipv6.conf.br-ex.forwarding = 0
      net.ipv6.conf.br-ex.mc_forwarding = 0
      sh-5.1# sysctl -w net.ipv4.conf.br-ex.forwarding=1
      net.ipv4.conf.br-ex.forwarding = 1
      
      =====
      [surya@hidden-temple ~]$ oc debug node/ip-10-0-199-154.us-west-2.compute.internal -n openshift-ovn-kubernetes                                                                
      Starting pod/ip-10-0-199-154us-west-2computeinternal-debug ...                                                                                                               
      To use host binaries, run `chroot /host`                                                                                                                                     
      Pod IP: 10.0.199.154                                                                                                                                                         
      If you don't see a command prompt, try pressing enter.                                                                                                                       
      sh-4.4# chroot /host                                                                                                                                                         sh-5.1# sysctl -a | grep forwarding | grep mp0
      net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0
      net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
      net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0
      net.ipv6.conf.ovn-k8s-mp0.forwarding = 0
      net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0
      sh-5.1# sysctl -a | grep forwarding | grep br-ex
      net.ipv4.conf.br-ex.bc_forwarding = 0
      net.ipv4.conf.br-ex.forwarding = 0
      net.ipv4.conf.br-ex.mc_forwarding = 0
      net.ipv6.conf.br-ex.forwarding = 0
      net.ipv6.conf.br-ex.mc_forwarding = 0
      sh-5.1# sysctl -w net.ipv4.conf.br-ex.forwarding=1
      net.ipv4.conf.br-ex.forwarding = 1
      sh-5.1# sysctl -a | grep forwarding | grep br-ex
      net.ipv4.conf.br-ex.bc_forwarding = 0
      net.ipv4.conf.br-ex.forwarding = 1
      net.ipv4.conf.br-ex.mc_forwarding = 0
      net.ipv6.conf.br-ex.forwarding = 0
      net.ipv6.conf.br-ex.mc_forwarding = 0
      

      Immediately after auth,ingress upgrade -> node-tuning operator upgrades, then console and auth start having issues with connections - when we check the br-ex forwarding case it seems it is  now 0. I am not sure I follow exactly how/why and when the MCO scripts take effect but we need to look into the ordering on how this is done during upgrades.

      This happens only on worker nodes, not master nodes. I also don't know why this is not a problem for SGW mode?

              trozet@redhat.com Tim Rozet
              sseethar Surya Seetharaman
              Anurag Saxena Anurag Saxena
              Surya Seetharaman
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: