-
Bug
-
Resolution: Done-Errata
-
Critical
-
None
-
4.14.0
-
None
-
Important
-
No
-
SDN Sprint 237, SDN Sprint 238, SDN Sprint 239
-
3
-
Approved
-
False
-
Description of problem:
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway has been perma-failing downstream on master: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway
My intuition is that this started perma-failing after we merged https://github.com/openshift/machine-config-operator/pull/3676/files
Debugging notes so far:
- triggered a LGW upgrade
- there is a point where I did check startup logs in OVNK and saw this happening:
- 2023-05-27T17:22:04.794324968Z I0527 17:22:04.794309 4541 ovs.go:200] Exec(59): /usr/sbin/sysctl -w net.ipv4.conf.br-ex.forwarding=1
2023-05-27T17:22:04.794857428Z I0527 17:22:04.794835 4541 ovs.go:203] Exec(59): stdout: "net.ipv4.conf.br-ex.forwarding = 1\n"
2023-05-27T17:22:04.794857428Z I0527 17:22:04.794850 4541 ovs.go:204] Exec(59): stderr: ""
So we are setting the forwarding to 1, but at some point during the upgrade in OCP for lgw, this is reset to 0 -> not sure why, this leads to all operators getting degraded and upgrade doesn't finish
- I had to restart ovnkube-node pods so that it resets this to 1 and upgrade completes, but the disruption tests fail, we need to look into why this is happening.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Trigger a PR that will run the lgw upgrade job 2. 3.
Actual results:
upgrade fails
Expected results:
upgrade should pass
Additional info:
During cluster install: On a master node: [surya@hidden-temple ~]$ oc -n openshift-ovn-kubernetes debug node/ip-10-0-145-222.us-west-2.compute.internal Starting pod/ip-10-0-145-222us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.145.222 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# sysctl -a | grep forwarding | grep mp0 | grep br-ex sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/" sh-5.1# sysctl -a | grep forwarding | grep mp0 sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/" net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0 net.ipv4.conf.ovn-k8s-mp0.forwarding = 1 net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0 sh-5.1# sysctl -a | grep forwarding | grep br-ex sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/" net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 1 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 0 net.ipv6.conf.br-ex.mc_forwarding = 0 Other options: net.ipv4.conf.all.forwarding = 1 net.ipv4.ip_forward = 1 On a worker node: [surya@hidden-temple ~]$ oc -n openshift-ovn-kubernetes debug node/ip-10-0-135-158.us-west-2.compute.internal Starting pod/ip-10-0-135-158us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.135.158 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# sysctl -a | grep forwarding | grep br-ex sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/" net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 1 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 0 net.ipv6.conf.br-ex.mc_forwarding = 0 sh-5.1# sysctl -a | grep forwarding | grep mp0 sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/" net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0 net.ipv4.conf.ovn-k8s-mp0.forwarding = 1 net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0 net.ipv4.conf.all.forwarding = 1 ========= A bit later on: etcd 4.14.0-0.ci.test-2023-05-30-055806-ci-op-cztlhsfm-initial True True False 15m NodeInstallerProgressing: 1 nodes are at revision 6; 2 nodes are at revision 7 same way kube-apiserver, kube-controler-manager etc start their node installer work.. ==== oc -n openshift-ovn-kubernetes debug node/ip-10-0-135-158.us-west-2.compute.internal Starting pod/ip-10-0-135-158us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.135.158 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# sysctl -a | grep forwarding | grep br-ex net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 1 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 0 net.ipv6.conf.br-ex.mc_forwarding = 0 sh-5.1# sysctl -a | grep forwarding | grep br-ex net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 0 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 0 net.ipv6.conf.br-ex.mc_forwarding = 0 sh-5.1# sysctl -w net.ipv4.conf.br-ex.forwarding=1 net.ipv4.conf.br-ex.forwarding = 1 sh-5.1# sysctl -w net.ipv4.conf.ovn-k8s-mp0.forwarding=1 net.ipv4.conf.ovn-k8s-mp0.forwarding = 1 ==== [surya@hidden-temple ~]$ oc debug node/ip-10-0-151-194.us-west-2.compute.internal -n openshift-ovn-kubernetes Starting pod/ip-10-0-151-194us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.151.194 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# sysctl -a | grep forwarding | grep mp0 net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0 net.ipv4.conf.ovn-k8s-mp0.forwarding = 1 net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0 sh-5.1# sysctl -a | grep forwarding | grep br-ex net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 1 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 0 net.ipv6.conf.br-ex.mc_forwarding = 0 sh-5.1# sysctl -a | grep forwarding | grep br-ex net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 0 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 0 net.ipv6.conf.br-ex.mc_forwarding = 0 sh-5.1# sysctl -w net.ipv4.conf.br-ex.forwarding=1 net.ipv4.conf.br-ex.forwarding = 1 ===== [surya@hidden-temple ~]$ oc debug node/ip-10-0-199-154.us-west-2.compute.internal -n openshift-ovn-kubernetes Starting pod/ip-10-0-199-154us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.199.154 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# sysctl -a | grep forwarding | grep mp0 net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0 net.ipv4.conf.ovn-k8s-mp0.forwarding = 1 net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0 sh-5.1# sysctl -a | grep forwarding | grep br-ex net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 0 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 0 net.ipv6.conf.br-ex.mc_forwarding = 0 sh-5.1# sysctl -w net.ipv4.conf.br-ex.forwarding=1 net.ipv4.conf.br-ex.forwarding = 1 sh-5.1# sysctl -a | grep forwarding | grep br-ex net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 1 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 0 net.ipv6.conf.br-ex.mc_forwarding = 0
Immediately after auth,ingress upgrade -> node-tuning operator upgrades, then console and auth start having issues with connections - when we check the br-ex forwarding case it seems it is now 0. I am not sure I follow exactly how/why and when the MCO scripts take effect but we need to look into the ordering on how this is done during upgrades.
This happens only on worker nodes, not master nodes. I also don't know why this is not a problem for SGW mode?
- links to
-
RHEA-2023:5006 rpm