[OCPBUGS-11710] Connection problems with OVN-Kubernetes on OpenShift Container Platform 4.12 on AWS post hibernation

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.0
Affects Version/s: 4.12
Component/s: Networking / ovn-kubernetes
Labels:
- OVN-Kubernetes
- aws
- bug
- hibernate
- hive
- ocp-4.12

Severity:
Moderate
Regression:
No
Sprint:
SDN Sprint 235, SDN Sprint 236, SDN Sprint 237, SDN Sprint 238, SDN Sprint 239, SDN Sprint 240, SDN Sprint 241, SDN Sprint 242, SDN Sprint 243, SDN Sprint 244, SDN Sprint 245
sprint_count:
11
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Previously, an external neighbor could change its Media Access Control (MAC) address while the cluster was shutting down or hibernating. Although a Gratuitous Address Resolution Protocol (GARP) was supposed to inform the other neighbors about this change, the cluster did not process the GARP. After restarting the cluster, the neighbor might no longer be available from the OVN-Kubernetes cluster network because the outdated MAC address was being used. With this release, an update enables an aging mechanism so that a neighbor's MAC address is updated regularly every 300 seconds. (link:https://issues.redhat.com/browse/OCPBUGS-11710[*~~OCPBUGS-11710~~*])

Show
Previously, an external neighbor could change its Media Access Control (MAC) address while the cluster was shutting down or hibernating. Although a Gratuitous Address Resolution Protocol (GARP) was supposed to inform the other neighbors about this change, the cluster did not process the GARP. After restarting the cluster, the neighbor might no longer be available from the OVN-Kubernetes cluster network because the outdated MAC address was being used. With this release, an update enables an aging mechanism so that a neighbor's MAC address is updated regularly every 300 seconds. (link: https://issues.redhat.com/browse/OCPBUGS-11710 [* OCPBUGS-11710 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Impact Range:
PX Review Complete:
PX Technical Impact Notes:

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

is cloned by

OCPBUGS-31648 Connection problems with OVN-Kubernetes on OpenShift Container Platform 4.12 on AWS post hibernation

Closed

is depended on by

OCPBUGS-31648 Connection problems with OVN-Kubernetes on OpenShift Container Platform 4.12 on AWS post hibernation

Closed

links to

openshift/ovn-kubernetes#1939: OCPBUGS-OCPBUGS-11710: Downstream Merge 18th Oct 2023

RHEA-2023:7198 rpm

Solution (Knowledge Base)

Upstream fix -https://github.com/ovn-org/ovn-kubernetes/pull/3678

(1 links to)

Pablo Alonso Rodriguez added a comment - 2024/04/03 7:49 AM

jcaamano@redhat.com I think that looks like a way to go. At least, as it looks a bit more similar to what the kernel does (re-validate the entries or not based on "positive feedback"), so OVN-Kubernetes behavior would be closed to what most people are used to.

Thanks!

Pablo Alonso Rodriguez added a comment - 2024/04/03 7:49 AM jcaamano@redhat.com I think that looks like a way to go. At least, as it looks a bit more similar to what the kernel does (re-validate the entries or not based on "positive feedback"), so OVN-Kubernetes behavior would be closed to what most people are used to. Thanks!

Jaime Caamaño Ruiz added a comment - 2024/04/02 5:22 PM

I asked the OVN team. There is a mechanism that refreshes the timestamp of the mac binding entries based on their usage. That is useful because it potentially enables us to lower that timeout value in case it becomes a viable path forward to further improve those other cases. There is no additional mechanism to keep any other aspect of those entries updated based in any other traffic that is not ARP.

I created a new bug for the 4.14 backport.

Jaime Caamaño Ruiz added a comment - 2024/04/02 5:22 PM I asked the OVN team. There is a mechanism that refreshes the timestamp of the mac binding entries based on their usage. That is useful because it potentially enables us to lower that timeout value in case it becomes a viable path forward to further improve those other cases. There is no additional mechanism to keep any other aspect of those entries updated based in any other traffic that is not ARP. I created a new bug for the 4.14 backport.

Chris Fields added a comment - 2024/04/02 4:49 PM

Thanks for the update rhn-support-sreber !

I created a new bug that seeks a better solution for customers who are experiencing the symptoms in https://access.redhat.com/solutions/7037795: https://issues.redhat.com/browse/OCPBUGS-31643

Chris Fields added a comment - 2024/04/02 4:49 PM Thanks for the update rhn-support-sreber ! I created a new bug that seeks a better solution for customers who are experiencing the symptoms in https://access.redhat.com/solutions/7037795: https://issues.redhat.com/browse/OCPBUGS-31643

Pablo Alonso Rodriguez added a comment - 2024/04/01 4:17 PM

Regarding reducing the timeout... that also looks like a good idea for another feature ticket. IMHO, we should try to mimic what the kernel does, not only because of the more consistent behavior but also because it has years of people relying on those defaults. However, I might miss specific reasons why this may not be an idea as good as it sounds.

Pablo Alonso Rodriguez added a comment - 2024/04/01 4:17 PM Regarding reducing the timeout... that also looks like a good idea for another feature ticket. IMHO, we should try to mimic what the kernel does, not only because of the more consistent behavior but also because it has years of people relying on those defaults. However, I might miss specific reasons why this may not be an idea as good as it sounds.

Pablo Alonso Rodriguez added a comment - 2024/04/01 4:10 PM

Hi.

Local gateway can be definitely a valid workaround... except when egress IPs are configured, because egress IPs always work through the gateway routers regardless of the mode (I have not checked most recent versions, but in not so recent ones, it works this way) or when OVS hardware offload is desired (I have no examples of either situation, though).

However, I agree that for these cases it would be a relief, not a solution. I just tried to bring all the ideas I had, but maybe the other folks here can comment why their respective customers cannot go use local gateway as a workaround.

Pablo Alonso Rodriguez added a comment - 2024/04/01 4:10 PM Hi. Local gateway can be definitely a valid workaround... except when egress IPs are configured, because egress IPs always work through the gateway routers regardless of the mode (I have not checked most recent versions, but in not so recent ones, it works this way) or when OVS hardware offload is desired (I have no examples of either situation, though). However, I agree that for these cases it would be a relief, not a solution. I just tried to bring all the ideas I had, but maybe the other folks here can comment why their respective customers cannot go use local gateway as a workaround.

Jaime Caamaño Ruiz added a comment - 2024/04/01 3:36 PM

rhn-support-palonsor

I will have to insist, the hibernation scenario is not comparable to the scenarios you are describing, so at least in my view, it is not the first scenario or the most likely one. In this scenario, we clearly know why GARPs are not doing their job, and we have a proper fix for that, while in the other scenarios we don't know. So we would be backporting something hoping to fix some problem we don't know why or how it is happening, or without knowing if there is a more proper or easier or quicker fix.

Also, looking into these customer use cases is a good time to know if we should (and could afford to) reduce the timeout.

I see the use case of a self healing capability in case a mac binding entry becomes stale. And that's ok as long as the customers understand that rather than seeing their problem being resolved, which we don't know really. And also I would have a hard time understand why local gateway mode is not a better option for these customers as far as a remediation or relief goes.

We just need to have in mind that backports are not free and they also introduce the risk of other issues so we shouldn't be doing them lightly.

Jaime Caamaño Ruiz added a comment - 2024/04/01 3:36 PM rhn-support-palonsor I will have to insist, the hibernation scenario is not comparable to the scenarios you are describing, so at least in my view, it is not the first scenario or the most likely one. In this scenario, we clearly know why GARPs are not doing their job, and we have a proper fix for that, while in the other scenarios we don't know. So we would be backporting something hoping to fix some problem we don't know why or how it is happening, or without knowing if there is a more proper or easier or quicker fix. Also, looking into these customer use cases is a good time to know if we should (and could afford to) reduce the timeout. I see the use case of a self healing capability in case a mac binding entry becomes stale. And that's ok as long as the customers understand that rather than seeing their problem being resolved, which we don't know really. And also I would have a hard time understand why local gateway mode is not a better option for these customers as far as a remediation or relief goes. We just need to have in mind that backports are not free and they also introduce the risk of other issues so we shouldn't be doing them lightly.

Pablo Alonso Rodriguez added a comment - 2024/04/01 12:59 PM

jcaamano@redhat.com the hibernation is likely to have been the first scenario and the most likely one, not the only one.

Regarding the 5 minutes blackout... To be honest, I missed the part that it was 300 seconds (and not 30 seconds, like the kernel base_reachable_time_ms sysctl), I mistakenly expected that we would mimic the kernel behavior, my bad.

5 minutes blackout is not ideal for sure but it is at least something: "VIP may eventually work" is better than "VIP won't work until human interaction happens". Specially if "human interaction" means that customers have to either wipe OVN databases (which has cluster-wide impact in 4.13 and lower) or manually mess with the SBDB (with the corresponding risk in case there is a mistake or typo).

So in the case of these customers, we would just be providing some relief like "if you encounter the issue, the impact is having 5 minutes blackouts on VIP movements" vs needing to mess in complex places. That wouldn't remove the need to investigate the GARP failures, on the other hand.

Pablo Alonso Rodriguez added a comment - 2024/04/01 12:59 PM jcaamano@redhat.com the hibernation is likely to have been the first scenario and the most likely one, not the only one. Regarding the 5 minutes blackout... To be honest, I missed the part that it was 300 seconds (and not 30 seconds, like the kernel base_reachable_time_ms sysctl), I mistakenly expected that we would mimic the kernel behavior, my bad. 5 minutes blackout is not ideal for sure but it is at least something: "VIP may eventually work" is better than "VIP won't work until human interaction happens". Specially if "human interaction" means that customers have to either wipe OVN databases (which has cluster-wide impact in 4.13 and lower) or manually mess with the SBDB (with the corresponding risk in case there is a mistake or typo). So in the case of these customers, we would just be providing some relief like "if you encounter the issue, the impact is having 5 minutes blackouts on VIP movements" vs needing to mess in complex places. That wouldn't remove the need to investigate the GARP failures, on the other hand.

Jaime Caamaño Ruiz added a comment - 2024/04/01 12:04 PM

rhn-support-palonsor

I am a bit confused here because this was targeting an scenario were the cluster is shutdown or hibernating, but now we are saying gratuitous ARPs cannot be taken for granted because of many other reasons which puts us on a completely different scenario. For the cases you are talking, these are even baremetal cluster internal GARPs which should have less reasons for them not to work?

Let me put this through a different perspective: for these customers that are having this issue so frequently, are they OK then with the 5 minute blackout they will suffer which is the fixed period by which this solution clears old entries from the mac binding table? I wouldn't think so, while I agree that this is better than nothing, clearly a 5 minute blackout when a VIP moves around is unlikely to be the best scenario either.

So overall, I am assuming there is specific reasons and/or significant frequency this GARP is not working to justify the backport, but then at the same time if that is the case I am not sure this is the proper solution.

I will ask the OVN team whether they implemented any additional mechanism to refresh or invalidate entries over this static aging mechanism because I think they had an idea to do so. Otherwise these customers might be better served by the other workaround: using local gateway mode.

Jaime Caamaño Ruiz added a comment - 2024/04/01 12:04 PM rhn-support-palonsor I am a bit confused here because this was targeting an scenario were the cluster is shutdown or hibernating, but now we are saying gratuitous ARPs cannot be taken for granted because of many other reasons which puts us on a completely different scenario. For the cases you are talking, these are even baremetal cluster internal GARPs which should have less reasons for them not to work? Let me put this through a different perspective: for these customers that are having this issue so frequently, are they OK then with the 5 minute blackout they will suffer which is the fixed period by which this solution clears old entries from the mac binding table? I wouldn't think so, while I agree that this is better than nothing, clearly a 5 minute blackout when a VIP moves around is unlikely to be the best scenario either. So overall, I am assuming there is specific reasons and/or significant frequency this GARP is not working to justify the backport, but then at the same time if that is the case I am not sure this is the proper solution. I will ask the OVN team whether they implemented any additional mechanism to refresh or invalidate entries over this static aging mechanism because I think they had an idea to do so. Otherwise these customers might be better served by the other workaround: using local gateway mode.

Pablo Alonso Rodriguez added a comment - 2024/04/01 11:06 AM - edited

jcaamano@redhat.com if my neurons have not disconnected too much during these holidays, this impacts any scenario where an IP on the same subnet than the nodes changes its MAC and gratuitous ARPs cannot be properly received.

This, in turn, can impact a number of VIP scenarios (where the IP is always the VIP but the MAC changes to the one of whatever node holds the VIP), including (but not limited to):

Pods connecting to control plane or ingress VIPs. We even have some cluster operators that do this.
Application pods connecting to other applications on the same cluster via ingress VIP (although not optimal from nw point of view), this is a quite usual scenario, sometimes even due to legit limitations like funny corner cases related to TLS certificates, endpoints exposed by auto-discovery endpoints, etc.
External services used by applications that are deployed in the same subnet than the nodes and use a keepalived-like VIP.

The fact that a gratuitous ARP workarounds the issue reduces its likelihood greatly. However, gratuitous ARPs are not delivered by a reliable mechanism, so they cannot be taken for granted (because there is poor network, some improper security policy blocking them, temporary issues...).

Pablo Alonso Rodriguez added a comment - 2024/04/01 11:06 AM - edited jcaamano@redhat.com if my neurons have not disconnected too much during these holidays, this impacts any scenario where an IP on the same subnet than the nodes changes its MAC and gratuitous ARPs cannot be properly received. This, in turn, can impact a number of VIP scenarios (where the IP is always the VIP but the MAC changes to the one of whatever node holds the VIP), including (but not limited to): Pods connecting to control plane or ingress VIPs . We even have some cluster operators that do this. Application pods connecting to other applications on the same cluster via ingress VIP (although not optimal from nw point of view), this is a quite usual scenario, sometimes even due to legit limitations like funny corner cases related to TLS certificates, endpoints exposed by auto-discovery endpoints, etc. External services used by applications that are deployed in the same subnet than the nodes and use a keepalived-like VIP. The fact that a gratuitous ARP workarounds the issue reduces its likelihood greatly. However, gratuitous ARPs are not delivered by a reliable mechanism, so they cannot be taken for granted (because there is poor network, some improper security policy blocking them, temporary issues...).

Jaime Caamaño Ruiz added a comment - 2024/04/01 9:45 AM - edited

rhn-support-cfields

The code change is extensive but I don't anticipate many conflicts on a backport to 4.14 so I don't think that it would require much more effort than a regular backport. The version of OVN already present in 4.14 should be sufficient.

Please note an associated OVN issue with this change has come up: https://issues.redhat.com/browse/FDP-439
We still don't know the full relationship or extent of the consequences but it has not made much noise yet. We probably need feedback from the OVN team if they recommend to hold the backport.

Another thing I have not clear yet is what it the use case of all the customers asking for a backport. Do they all use hibernation on AWS hosted clusters? Or are we talking about a different use case?

Jaime Caamaño Ruiz added a comment - 2024/04/01 9:45 AM - edited rhn-support-cfields The code change is extensive but I don't anticipate many conflicts on a backport to 4.14 so I don't think that it would require much more effort than a regular backport. The version of OVN already present in 4.14 should be sufficient. Please note an associated OVN issue with this change has come up: https://issues.redhat.com/browse/FDP-439 We still don't know the full relationship or extent of the consequences but it has not made much noise yet. We probably need feedback from the OVN team if they recommend to hold the backport. Another thing I have not clear yet is what it the use case of all the customers asking for a backport. Do they all use hibernation on AWS hosted clusters? Or are we talking about a different use case?

Assignee:: Jaime Caamaño Ruiz

Reporter:: Simon Reber

QA Contact:: Arti Sood

Need Info From:: Chris Fields, Jaime Caamaño Ruiz

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2023/04/12 1:19 PM

Updated:: 2024/08/27 3:03 AM

Resolved:: 2024/02/27 8:59 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Pablo Alonso Rodriguez added a comment - 2024/04/03 7:49 AM

Expand comment: Pablo Alonso Rodriguez added a comment - 2024/04/03 7:49 AM

Collapse comment: Jaime Caamaño Ruiz added a comment - 2024/04/02 5:22 PM

Expand comment: Jaime Caamaño Ruiz added a comment - 2024/04/02 5:22 PM

Collapse comment: Chris Fields added a comment - 2024/04/02 4:49 PM

Expand comment: Chris Fields added a comment - 2024/04/02 4:49 PM

Collapse comment: Pablo Alonso Rodriguez added a comment - 2024/04/01 4:17 PM

Expand comment: Pablo Alonso Rodriguez added a comment - 2024/04/01 4:17 PM

Collapse comment: Pablo Alonso Rodriguez added a comment - 2024/04/01 4:10 PM

Expand comment: Pablo Alonso Rodriguez added a comment - 2024/04/01 4:10 PM

Collapse comment: Jaime Caamaño Ruiz added a comment - 2024/04/01 3:36 PM

Expand comment: Jaime Caamaño Ruiz added a comment - 2024/04/01 3:36 PM

Collapse comment: Pablo Alonso Rodriguez added a comment - 2024/04/01 12:59 PM

Expand comment: Pablo Alonso Rodriguez added a comment - 2024/04/01 12:59 PM

Collapse comment: Jaime Caamaño Ruiz added a comment - 2024/04/01 12:04 PM

Expand comment: Jaime Caamaño Ruiz added a comment - 2024/04/01 12:04 PM

Collapse comment: Pablo Alonso Rodriguez added a comment - 2024/04/01 11:06 AM, Edited by Pablo Alonso Rodriguez - 2024/04/01 11:06 AM

Expand comment: Pablo Alonso Rodriguez added a comment - 2024/04/01 11:06 AM, Edited by Pablo Alonso Rodriguez - 2024/04/01 11:06 AM

Collapse comment: Jaime Caamaño Ruiz added a comment - 2024/04/01 9:45 AM, Edited by Jaime Caamaño Ruiz - 2024/04/01 9:45 AM

Expand comment: Jaime Caamaño Ruiz added a comment - 2024/04/01 9:45 AM, Edited by Jaime Caamaño Ruiz - 2024/04/01 9:45 AM

People

Dates