[OCPBUGS-9199] Incorrect NAT when using cluster networking in control-plane nodes to install a VRRP Cluster - Red Hat Issue Tracker

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.8
Component/s: Installer / Assisted installer
Labels:
- migrated_from_bz
- needs_manual_sfdc

Severity:
Moderate
Regression:
None
Release Blocker:
Rejected
Architecture:

All
Release Note Type:
If docs needed, set a value
Target Version:

4.12.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:
In OCP VRRP deployment (using OCP cluster networking), we have an additional data interface which is configured along with the regular management interface in each control node. In some deployments, the kubernetes address 172.30.0.1:443 is nat’ed to the data management interface instead of the mgmt interface (10.40.1.4:6443 vs 10.30.1.4:6443 as we configure the boostrap node) even though the default route is set to 10.30.1.0 network. Because of that, all requests to 172.30.0.1:443 were failed. After 10-15 minutes, OCP magically fixes it and nat’ing correctly to 10.30.1.4:6443.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.Provision OCP cluster using cluster networking for DNS & Load Balancer instead of external DNS & Load Balancer. Provision the host with 1 management interface and an additional interface for data network. Along with OCP manifest, add manifest to create a pod which will trigger communication with kube-apiserver.

2.Start cluster installation.

3.Check on the custom pod log in the cluster when the first 2 master nodes were installing to see GET operation to kube-apiserver timed out. Check nft table and chase the ip chains to see the that the data IP address was nat'ed to kubernetes service IP address instead of the management IP. This is not happening all the time, we have seen 50:50 chance.

Actual results:
After 10-15 minutes OCP will correct that by itself.

Expected results:
Wrong natting should not happen.

Additional info:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
ClusterVersion: Stable at "4.8.29"
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/baremetal is degraded because metal3 deployment inaccessible
clusteroperator/console is not available (RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health): Get "https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health): Get "https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
clusteroperator/insights is degraded because Unable to report: unable to build request to connect to Insights server: Post "https://cloud.redhat.com/api/ingress/v1/upload": dial tcp: lookup cloud.redhat.com on 172.30.0.10:53: read udp 10.128.0.26:53697->172.30.0.10:53: i/o timeout
clusteroperator/network is progressing: DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)

Assignee:: Andreas Karis

Reporter:: OpenShift Jira Bot

QA Contact:: Johnny Liu

Contributing Groups:: Red Hat Employee

Need Info From:: Igal Tsoiref

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Created:: 2022/03/30 7:39 PM

Updated:: 2025/01/16 12:32 PM

Resolved:: 2023/03/21 11:34 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates