Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.16.z
Affects Version/s: 4.14.z
Component/s: Networking / On-Prem Load Balancer
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No
Architecture:

x86_64

Target Backport Versions:

4.14.z, 4.15.z, 4.16
Target Version:

4.16.z
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Test Coverage:

+

PX Priority Data:
PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Release Note Not Required
Release Note Text:
None

Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Clone of https://issues.redhat.com/browse/OCPBUGS-32141 for 4.16
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Description of problem:
VIP's are on a different network than the machine network on a 4.14 cluster

failing cluster: 4:14

Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.83
apiServerInternalIPs: 10.8.0.83
ingressIP: 10.8.0.84
ingressIPs: 10.8.0.84

All internal IP addresses of all nodes match the Machine Network.

Machine Network: 10.8.42.0/23

Node name IP Address Matches CIDR
..............................................................................................................
sv1-prd-ocp-int-bn8ln-master-0 10.8.42.24 YES
sv1-prd-ocp-int-bn8ln-master-1 10.8.42.35 YES
sv1-prd-ocp-int-bn8ln-master-2 10.8.42.36 YES
sv1-prd-ocp-int-bn8ln-worker-0-5rbwr 10.8.42.32 YES
sv1-prd-ocp-int-bn8ln-worker-0-h7fq7 10.8.42.49 YES

logs from one of the haproxy pods

oc logs -n openshift-vsphere-infra haproxy-sv1-prd-ocp-int-bn8ln-master-0 haproxy-monitor
.....
2024-04-02T18:48:57.534824711Z time="2024-04-02T18:48:57Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.534849744Z time="2024-04-02T18:48:57Z" level=info msg="Trying to read master nodes details from localhost:kube-apiserver"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:49:00.572652095Z time="2024-04-02T18:49:00Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"

There is a kcs that addresses this:
https://access.redhat.com/solutions/7037425

Howerver, this same configuration works in production on 4.12

working cluster:
Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.73
apiServerInternalIPs: 10.8.0.73
ingressIP: 10.8.0.72
ingressIPs: 10.8.0.72

All internal IP addresses of all nodes match the Machine Network.

Machine Network: 10.8.38.0/23

Node name IP Address Matches CIDR
..............................................................................................................
sb1-prd-ocp-int-qls2m-cp4d-4875s 10.8.38.29 YES
sb1-prd-ocp-int-qls2m-cp4d-phczw 10.8.38.19 YES
sb1-prd-ocp-int-qls2m-cp4d-ql5sj 10.8.38.43 YES
sb1-prd-ocp-int-qls2m-cp4d-svzl7 10.8.38.27 YES
sb1-prd-ocp-int-qls2m-cp4d-x286s 10.8.38.18 YES
sb1-prd-ocp-int-qls2m-cp4d-xk48m 10.8.38.40 YES
sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 YES
sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 YES
sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 YES
sb1-prd-ocp-int-qls2m-worker-njzdx 10.8.38.15 YES
sb1-prd-ocp-int-qls2m-worker-rhqn5 10.8.38.39 YES

logs from one of the haproxy pods

2023-08-18T21:12:19.730010034Z time="2023-08-18T21:12:19Z" level=info msg="API is not reachable through HAProxy"
2023-08-18T21:12:19.755357706Z time="2023-08-18T21:12:19Z" level=info msg="Config change detected" configChangeCtr=1 curConfig="{6443 9445 29445 [

{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443} {sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443} {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}] }"
2023-08-18T21:12:19.782529185Z time="2023-08-18T21:12:19Z" level=info msg="Removing existing nat PREROUTING rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT"
2023-08-18T21:12:19.794532220Z time="2023-08-18T21:12:19Z" level=info msg="Removing existing nat OUTPUT rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT -o lo"
2023-08-18T21:12:25.816406455Z time="2023-08-18T21:12:25Z" level=info msg="Config change detected" configChangeCtr=2 curConfig="{6443 9445 29445 [{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443}

{sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443} {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}] }"
2023-08-18T21:12:25.919248671Z time="2023-08-18T21:12:25Z" level=info msg="Removing existing nat PREROUTING rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT"
2023-08-18T21:12:25.965663811Z time="2023-08-18T21:12:25Z" level=info msg="Removing existing nat OUTPUT rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT -o lo"
2023-08-18T21:12:32.005310398Z time="2023-08-18T21:12:32Z" level=info msg="Config change detected" configChangeCtr=3 curConfig="{6443 9445 29445 [{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443} {sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443}

{sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}

] }"

The data is being redirected

found this in the sos report: sos_commands/firewall_tables/

nft_-a_list_ruleset

table ip nat { # handle 2
chain PREROUTING

{ # handle 1 type nat hook prerouting priority dstnat; policy accept; meta l4proto tcp ip daddr 10.8.0.73 tcp dport 6443 counter packets 0 bytes 0 redirect to :9445 # handle 66 counter packets 82025408 bytes 5088067290 jump OVN-KUBE-ETP # handle 30 counter packets 82025421 bytes 5088068062 jump OVN-KUBE-EXTERNALIP # handle 28 counter packets 82025439 bytes 5088069114 jump OVN-KUBE-NODEPORT # handle 26 }

chain INPUT

{ # handle 2 type nat hook input priority 100; policy accept; }

chain POSTROUTING

{ # handle 3 type nat hook postrouting priority srcnat; policy accept; counter packets 245475292 bytes 16221809463 jump OVN-KUBE-EGRESS-SVC # handle 25 oifname "ovn-k8s-mp0" counter packets 58115015 bytes 4184247096 jump OVN-KUBE-SNAT-MGMTPORT # handle 16 counter packets 187360548 bytes 12037581317 jump KUBE-POSTROUTING # handle 10 }

chain OUTPUT

{ # handle 4 type nat hook output priority -100; policy accept; oifname "lo" meta l4proto tcp ip daddr 10.8.0.73 tcp dport 6443 counter packets 0 bytes 0 redirect to :9445 # handle 67 counter packets 245122162 bytes 16200621351 jump OVN-KUBE-EXTERNALIP # handle 29 counter packets 245122163 bytes 16200621411 jump OVN-KUBE-NODEPORT # handle 27 counter packets 245122166 bytes 16200621591 jump OVN-KUBE-ITP # handle 24 }

... many more lines ...

This code was not added by the customer

None of the redirect statements are in the same file for 4.14 (the failing cluster)

ocp 4.14: (if applicable):{code:none}

How reproducible:100%

    Steps to Reproduce:{code:none}
This is the install script that our ansible job uses to install 4.12

If you need it cleared up let me know, all the items in {{}} are just variables for file paths

cp -r {{  item.0.cluster_name }}/install-config.yaml {{ openshift_base }}{{  item.0.cluster_name }}/
./openshift-install create manifests --dir {{ openshift_base }}{{  item.0.cluster_name }}/
cp -r machineconfigs/* {{ openshift_base }}{{  item.0.cluster_name }}/openshift/
cp -r {{  item.0.cluster_name }}/customizations/* {{ openshift_base }}{{  item.0.cluster_name }}/openshift/
./openshift-install create ignition-configs --dir {{ openshift_base }}{{  item.0.cluster_name }}/
./openshift-install create cluster --dir {{ openshift_base }}{{  item.0.cluster_name }} --log-level=debug

We are installing IPI on vmware

API and Ingress VIPs are configured on our external load balancer appliance. (Citrix ADCs if that matters)

Actual results:


haproxy pods crashloop and do not work
In 4.14 following the same install workflow neither the API or Ingress IP binds to masters or workers and we see HAPROXY crashlooping

Expected results:


for 4.12
Following a completion of 4.12 if we look in vmware at our master and worker nodes we will see all of them have an IP address from the machine network assigned to them, and one node from both masters and workers will have the VIP bound to them as well.

Additional info:

blocks

OCPBUGS-36278 [4.15] haproxy crashlooping fresh install Openshift 4.14.10

Closed

clones

OCPBUGS-32141 haproxy crashlooping fresh install Openshift 4.14.10

Closed

depends on

OCPBUGS-32141 haproxy crashlooping fresh install Openshift 4.14.10

Closed

is cloned by

OCPBUGS-36278 [4.15] haproxy crashlooping fresh install Openshift 4.14.10

Closed

links to

openshift/baremetal-runtimecfg#320: OCPBUGS-35743: Fix handling of ELB Node IP detection

RHBA-2024:4316 OpenShift Container Platform 4.16.z bug fix update

(1 links to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide