Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.14.0
Component/s: Networking / ovn-kubernetes
Labels:
None

Severity:
Important
Regression:
No
Sprint:
SDN Sprint 237, SDN Sprint 238, SDN Sprint 239
sprint_count:
3
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway has been perma-failing downstream on master: 
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway

Job History: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway

My intuition is that this started perma-failing after we merged https://github.com/openshift/machine-config-operator/pull/3676/files

Debugging notes so far:

triggered a LGW upgrade
there is a point where I did check startup logs in OVNK and saw this happening:
2023-05-27T17:22:04.794324968Z I0527 17:22:04.794309 4541 ovs.go:200] Exec(59): /usr/sbin/sysctl -w net.ipv4.conf.br-ex.forwarding=1
2023-05-27T17:22:04.794857428Z I0527 17:22:04.794835 4541 ovs.go:203] Exec(59): stdout: "net.ipv4.conf.br-ex.forwarding = 1\n"
2023-05-27T17:22:04.794857428Z I0527 17:22:04.794850 4541 ovs.go:204] Exec(59): stderr: ""
So we are setting the forwarding to 1, but at some point during the upgrade in OCP for lgw, this is reset to 0 -> not sure why, this leads to all operators getting degraded and upgrade doesn't finish

I had to restart ovnkube-node pods so that it resets this to 1 and upgrade completes, but the disruption tests fail, we need to look into why this is happening.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Trigger a PR that will run the lgw upgrade job
2.
3.

Actual results:

upgrade fails

Expected results:

upgrade should pass

Additional info:

During cluster install: On a master node:
[surya@hidden-temple ~]$ oc -n openshift-ovn-kubernetes debug node/ip-10-0-145-222.us-west-2.compute.internal
Starting pod/ip-10-0-145-222us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.145.222
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# sysctl -a | grep forwarding | grep mp0 | grep br-ex
sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
sh-5.1# sysctl -a | grep forwarding | grep mp0 
sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0
net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0
net.ipv6.conf.ovn-k8s-mp0.forwarding = 0
net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0
sh-5.1# sysctl -a | grep forwarding | grep br-ex
sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
net.ipv4.conf.br-ex.bc_forwarding = 0
net.ipv4.conf.br-ex.forwarding = 1
net.ipv4.conf.br-ex.mc_forwarding = 0
net.ipv6.conf.br-ex.forwarding = 0
net.ipv6.conf.br-ex.mc_forwarding = 0

Other options:
net.ipv4.conf.all.forwarding = 1                                                                                                   net.ipv4.ip_forward = 1
                     

On a worker node:
[surya@hidden-temple ~]$ oc -n openshift-ovn-kubernetes debug node/ip-10-0-135-158.us-west-2.compute.internal
Starting pod/ip-10-0-135-158us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.135.158
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# sysctl -a | grep forwarding | grep br-ex
sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
net.ipv4.conf.br-ex.bc_forwarding = 0
net.ipv4.conf.br-ex.forwarding = 1
net.ipv4.conf.br-ex.mc_forwarding = 0
net.ipv6.conf.br-ex.forwarding = 0
net.ipv6.conf.br-ex.mc_forwarding = 0
sh-5.1# sysctl -a | grep forwarding | grep mp0
sysctl: unable to open directory "/proc/sys/fs/binfmt_misc/"
net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0
net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0
net.ipv6.conf.ovn-k8s-mp0.forwarding = 0
net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0

net.ipv4.conf.all.forwarding = 1 

=========
A bit later on: etcd                                       4.14.0-0.ci.test-2023-05-30-055806-ci-op-cztlhsfm-initial   True        True          False      15m     NodeInstallerProgressing: 1 nodes are at revision 6; 2 nodes are at revision 7
same way kube-apiserver, kube-controler-manager etc start their node installer work..

====
oc -n openshift-ovn-kubernetes debug node/ip-10-0-135-158.us-west-2.compute.internal                                                                
Starting pod/ip-10-0-135-158us-west-2computeinternal-debug ...                                                                                                               
To use host binaries, run `chroot /host`                                                                                                                                     
Pod IP: 10.0.135.158                                                                                                                                                         
If you don't see a command prompt, try pressing enter.                                                                                                                       
sh-4.4# chroot /host                                                                                           
sh-5.1# sysctl -a | grep forwarding | grep br-ex
net.ipv4.conf.br-ex.bc_forwarding = 0
net.ipv4.conf.br-ex.forwarding = 1
net.ipv4.conf.br-ex.mc_forwarding = 0
net.ipv6.conf.br-ex.forwarding = 0
net.ipv6.conf.br-ex.mc_forwarding = 0
sh-5.1# sysctl -a | grep forwarding | grep br-ex
net.ipv4.conf.br-ex.bc_forwarding = 0
net.ipv4.conf.br-ex.forwarding = 0
net.ipv4.conf.br-ex.mc_forwarding = 0
net.ipv6.conf.br-ex.forwarding = 0
net.ipv6.conf.br-ex.mc_forwarding = 0
sh-5.1# sysctl -w net.ipv4.conf.br-ex.forwarding=1
net.ipv4.conf.br-ex.forwarding = 1
sh-5.1# sysctl -w net.ipv4.conf.ovn-k8s-mp0.forwarding=1
net.ipv4.conf.ovn-k8s-mp0.forwarding = 1

====
[surya@hidden-temple ~]$ oc debug node/ip-10-0-151-194.us-west-2.compute.internal -n openshift-ovn-kubernetes
Starting pod/ip-10-0-151-194us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.151.194
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# sysctl -a | grep forwarding | grep mp0
net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0
net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0
net.ipv6.conf.ovn-k8s-mp0.forwarding = 0
net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0
sh-5.1# sysctl -a | grep forwarding | grep br-ex
net.ipv4.conf.br-ex.bc_forwarding = 0
net.ipv4.conf.br-ex.forwarding = 1
net.ipv4.conf.br-ex.mc_forwarding = 0
net.ipv6.conf.br-ex.forwarding = 0
net.ipv6.conf.br-ex.mc_forwarding = 0
sh-5.1# sysctl -a | grep forwarding | grep br-ex
net.ipv4.conf.br-ex.bc_forwarding = 0
net.ipv4.conf.br-ex.forwarding = 0
net.ipv4.conf.br-ex.mc_forwarding = 0
net.ipv6.conf.br-ex.forwarding = 0
net.ipv6.conf.br-ex.mc_forwarding = 0
sh-5.1# sysctl -w net.ipv4.conf.br-ex.forwarding=1
net.ipv4.conf.br-ex.forwarding = 1

=====
[surya@hidden-temple ~]$ oc debug node/ip-10-0-199-154.us-west-2.compute.internal -n openshift-ovn-kubernetes                                                                
Starting pod/ip-10-0-199-154us-west-2computeinternal-debug ...                                                                                                               
To use host binaries, run `chroot /host`                                                                                                                                     
Pod IP: 10.0.199.154                                                                                                                                                         
If you don't see a command prompt, try pressing enter.                                                                                                                       
sh-4.4# chroot /host                                                                                                                                                         sh-5.1# sysctl -a | grep forwarding | grep mp0
net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0
net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0
net.ipv6.conf.ovn-k8s-mp0.forwarding = 0
net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0
sh-5.1# sysctl -a | grep forwarding | grep br-ex
net.ipv4.conf.br-ex.bc_forwarding = 0
net.ipv4.conf.br-ex.forwarding = 0
net.ipv4.conf.br-ex.mc_forwarding = 0
net.ipv6.conf.br-ex.forwarding = 0
net.ipv6.conf.br-ex.mc_forwarding = 0
sh-5.1# sysctl -w net.ipv4.conf.br-ex.forwarding=1
net.ipv4.conf.br-ex.forwarding = 1
sh-5.1# sysctl -a | grep forwarding | grep br-ex
net.ipv4.conf.br-ex.bc_forwarding = 0
net.ipv4.conf.br-ex.forwarding = 1
net.ipv4.conf.br-ex.mc_forwarding = 0
net.ipv6.conf.br-ex.forwarding = 0
net.ipv6.conf.br-ex.mc_forwarding = 0

Immediately after auth,ingress upgrade -> node-tuning operator upgrades, then console and auth start having issues with connections - when we check the br-ex forwarding case it seems it is now 0. I am not sure I follow exactly how/why and when the MCO scripts take effect but we need to look into the ordering on how this is done during upgrades.

This happens only on worker nodes, not master nodes. I also don't know why this is not a problem for SGW mode?

links to

RHEA-2023:5006 rpm

Assignee:: Tim Rozet

Reporter:: Surya Seetharaman

QA Contact:: Anurag Saxena

Contributors:: Surya Seetharaman

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/05/30 6:02 AM

Updated:: 2023/10/31 1:35 PM

Resolved:: 2023/10/31 1:35 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates