-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.14.z
-
+
-
Critical
-
No
-
Rejected
-
x86_64
-
False
-
-
Release Note Not Required
-
In Progress
-
-
-
-
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Clone of https://issues.redhat.com/browse/OCPBUGS-32141 for 4.16
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Description of problem:
VIP's are on a different network than the machine network on a 4.14 cluster
failing cluster: 4:14
Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.83
apiServerInternalIPs: 10.8.0.83
ingressIP: 10.8.0.84
ingressIPs: 10.8.0.84
All internal IP addresses of all nodes match the Machine Network.
Machine Network: 10.8.42.0/23
Node name IP Address Matches CIDR
..............................................................................................................
sv1-prd-ocp-int-bn8ln-master-0 10.8.42.24 YES
sv1-prd-ocp-int-bn8ln-master-1 10.8.42.35 YES
sv1-prd-ocp-int-bn8ln-master-2 10.8.42.36 YES
sv1-prd-ocp-int-bn8ln-worker-0-5rbwr 10.8.42.32 YES
sv1-prd-ocp-int-bn8ln-worker-0-h7fq7 10.8.42.49 YES
logs from one of the haproxy pods
oc logs -n openshift-vsphere-infra haproxy-sv1-prd-ocp-int-bn8ln-master-0 haproxy-monitor
.....
2024-04-02T18:48:57.534824711Z time="2024-04-02T18:48:57Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.534849744Z time="2024-04-02T18:48:57Z" level=info msg="Trying to read master nodes details from localhost:kube-apiserver"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:49:00.572652095Z time="2024-04-02T18:49:00Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"
There is a kcs that addresses this:
https://access.redhat.com/solutions/7037425
Howerver, this same configuration works in production on 4.12
working cluster:
Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.73
apiServerInternalIPs: 10.8.0.73
ingressIP: 10.8.0.72
ingressIPs: 10.8.0.72
All internal IP addresses of all nodes match the Machine Network.
Machine Network: 10.8.38.0/23
Node name IP Address Matches CIDR
..............................................................................................................
sb1-prd-ocp-int-qls2m-cp4d-4875s 10.8.38.29 YES
sb1-prd-ocp-int-qls2m-cp4d-phczw 10.8.38.19 YES
sb1-prd-ocp-int-qls2m-cp4d-ql5sj 10.8.38.43 YES
sb1-prd-ocp-int-qls2m-cp4d-svzl7 10.8.38.27 YES
sb1-prd-ocp-int-qls2m-cp4d-x286s 10.8.38.18 YES
sb1-prd-ocp-int-qls2m-cp4d-xk48m 10.8.38.40 YES
sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 YES
sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 YES
sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 YES
sb1-prd-ocp-int-qls2m-worker-njzdx 10.8.38.15 YES
sb1-prd-ocp-int-qls2m-worker-rhqn5 10.8.38.39 YES
logs from one of the haproxy pods
2023-08-18T21:12:19.730010034Z time="2023-08-18T21:12:19Z" level=info msg="API is not reachable through HAProxy"
2023-08-18T21:12:19.755357706Z time="2023-08-18T21:12:19Z" level=info msg="Config change detected" configChangeCtr=1 curConfig="{6443 9445 29445 [
2023-08-18T21:12:19.782529185Z time="2023-08-18T21:12:19Z" level=info msg="Removing existing nat PREROUTING rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT"
2023-08-18T21:12:19.794532220Z time="2023-08-18T21:12:19Z" level=info msg="Removing existing nat OUTPUT rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT -o lo"
2023-08-18T21:12:25.816406455Z time="2023-08-18T21:12:25Z" level=info msg="Config change detected" configChangeCtr=2 curConfig="{6443 9445 29445 [{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443}
{sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443} {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}] }"
2023-08-18T21:12:25.919248671Z time="2023-08-18T21:12:25Z" level=info msg="Removing existing nat PREROUTING rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT"
2023-08-18T21:12:25.965663811Z time="2023-08-18T21:12:25Z" level=info msg="Removing existing nat OUTPUT rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT -o lo"
2023-08-18T21:12:32.005310398Z time="2023-08-18T21:12:32Z" level=info msg="Config change detected" configChangeCtr=3 curConfig="{6443 9445 29445 [{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443} {sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443}
{sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}
] }"
The data is being redirected
found this in the sos report: sos_commands/firewall_tables/
nft_-a_list_ruleset
table ip nat { # handle 2
chain PREROUTING
chain INPUT
{ # handle 2 type nat hook input priority 100; policy accept; }chain POSTROUTING
{ # handle 3 type nat hook postrouting priority srcnat; policy accept; counter packets 245475292 bytes 16221809463 jump OVN-KUBE-EGRESS-SVC # handle 25 oifname "ovn-k8s-mp0" counter packets 58115015 bytes 4184247096 jump OVN-KUBE-SNAT-MGMTPORT # handle 16 counter packets 187360548 bytes 12037581317 jump KUBE-POSTROUTING # handle 10 }chain OUTPUT
{ # handle 4 type nat hook output priority -100; policy accept; oifname "lo" meta l4proto tcp ip daddr 10.8.0.73 tcp dport 6443 counter packets 0 bytes 0 redirect to :9445 # handle 67 counter packets 245122162 bytes 16200621351 jump OVN-KUBE-EXTERNALIP # handle 29 counter packets 245122163 bytes 16200621411 jump OVN-KUBE-NODEPORT # handle 27 counter packets 245122166 bytes 16200621591 jump OVN-KUBE-ITP # handle 24 }... many more lines ...
This code was not added by the customer
None of the redirect statements are in the same file for 4.14 (the failing cluster)
ocp 4.14: (if applicable):{code:none}
How reproducible:100%
Steps to Reproduce:{code:none} This is the install script that our ansible job uses to install 4.12 If you need it cleared up let me know, all the items in {{}} are just variables for file paths cp -r {{ item.0.cluster_name }}/install-config.yaml {{ openshift_base }}{{ item.0.cluster_name }}/ ./openshift-install create manifests --dir {{ openshift_base }}{{ item.0.cluster_name }}/ cp -r machineconfigs/* {{ openshift_base }}{{ item.0.cluster_name }}/openshift/ cp -r {{ item.0.cluster_name }}/customizations/* {{ openshift_base }}{{ item.0.cluster_name }}/openshift/ ./openshift-install create ignition-configs --dir {{ openshift_base }}{{ item.0.cluster_name }}/ ./openshift-install create cluster --dir {{ openshift_base }}{{ item.0.cluster_name }} --log-level=debug We are installing IPI on vmware API and Ingress VIPs are configured on our external load balancer appliance. (Citrix ADCs if that matters)
Actual results:
haproxy pods crashloop and do not work In 4.14 following the same install workflow neither the API or Ingress IP binds to masters or workers and we see HAPROXY crashlooping
Expected results:
for 4.12 Following a completion of 4.12 if we look in vmware at our master and worker nodes we will see all of them have an IP address from the machine network assigned to them, and one node from both masters and workers will have the VIP bound to them as well.
Additional info:
- blocks
-
OCPBUGS-36278 [4.15] haproxy crashlooping fresh install Openshift 4.14.10
- Closed
- clones
-
OCPBUGS-32141 haproxy crashlooping fresh install Openshift 4.14.10
- Closed
- depends on
-
OCPBUGS-32141 haproxy crashlooping fresh install Openshift 4.14.10
- Closed
- is cloned by
-
OCPBUGS-36278 [4.15] haproxy crashlooping fresh install Openshift 4.14.10
- Closed
- links to
-
RHBA-2024:4316 OpenShift Container Platform 4.16.z bug fix update