Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35743

[4.16] haproxy crashlooping fresh install Openshift 4.14.10

XMLWordPrintable

      ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      Clone of https://issues.redhat.com/browse/OCPBUGS-32141 for 4.16
      ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

      Description of problem:
      VIP's are on a different network than the machine network on a 4.14 cluster

      failing cluster: 4:14

      Infrastructure
      --------------
      Platform: VSphere
      Install Type: IPI
      apiServerInternalIP: 10.8.0.83
      apiServerInternalIPs: 10.8.0.83
      ingressIP: 10.8.0.84
      ingressIPs: 10.8.0.84

      All internal IP addresses of all nodes match the Machine Network.

      Machine Network: 10.8.42.0/23

      Node name IP Address Matches CIDR
      ..............................................................................................................
      sv1-prd-ocp-int-bn8ln-master-0 10.8.42.24 YES
      sv1-prd-ocp-int-bn8ln-master-1 10.8.42.35 YES
      sv1-prd-ocp-int-bn8ln-master-2 10.8.42.36 YES
      sv1-prd-ocp-int-bn8ln-worker-0-5rbwr 10.8.42.32 YES
      sv1-prd-ocp-int-bn8ln-worker-0-h7fq7 10.8.42.49 YES

      logs from one of the haproxy pods

      oc logs -n openshift-vsphere-infra haproxy-sv1-prd-ocp-int-bn8ln-master-0 haproxy-monitor
      .....
      2024-04-02T18:48:57.534824711Z time="2024-04-02T18:48:57Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: failed find a interface for the ip 10.8.0.83"
      2024-04-02T18:48:57.534849744Z time="2024-04-02T18:48:57Z" level=info msg="Trying to read master nodes details from localhost:kube-apiserver"
      2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"
      2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
      2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig
      2024-04-02T18:49:00.572652095Z time="2024-04-02T18:49:00Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"

      There is a kcs that addresses this:
      https://access.redhat.com/solutions/7037425

      Howerver, this same configuration works in production on 4.12

      working cluster:
      Infrastructure
      --------------
      Platform: VSphere
      Install Type: IPI
      apiServerInternalIP: 10.8.0.73
      apiServerInternalIPs: 10.8.0.73
      ingressIP: 10.8.0.72
      ingressIPs: 10.8.0.72

      All internal IP addresses of all nodes match the Machine Network.

      Machine Network: 10.8.38.0/23

      Node name IP Address Matches CIDR
      ..............................................................................................................
      sb1-prd-ocp-int-qls2m-cp4d-4875s 10.8.38.29 YES
      sb1-prd-ocp-int-qls2m-cp4d-phczw 10.8.38.19 YES
      sb1-prd-ocp-int-qls2m-cp4d-ql5sj 10.8.38.43 YES
      sb1-prd-ocp-int-qls2m-cp4d-svzl7 10.8.38.27 YES
      sb1-prd-ocp-int-qls2m-cp4d-x286s 10.8.38.18 YES
      sb1-prd-ocp-int-qls2m-cp4d-xk48m 10.8.38.40 YES
      sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 YES
      sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 YES
      sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 YES
      sb1-prd-ocp-int-qls2m-worker-njzdx 10.8.38.15 YES
      sb1-prd-ocp-int-qls2m-worker-rhqn5 10.8.38.39 YES

      logs from one of the haproxy pods

      2023-08-18T21:12:19.730010034Z time="2023-08-18T21:12:19Z" level=info msg="API is not reachable through HAProxy"
      2023-08-18T21:12:19.755357706Z time="2023-08-18T21:12:19Z" level=info msg="Config change detected" configChangeCtr=1 curConfig="{6443 9445 29445 [

      {sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443} {sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443} {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}] }"
      2023-08-18T21:12:19.782529185Z time="2023-08-18T21:12:19Z" level=info msg="Removing existing nat PREROUTING rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT"
      2023-08-18T21:12:19.794532220Z time="2023-08-18T21:12:19Z" level=info msg="Removing existing nat OUTPUT rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT -o lo"
      2023-08-18T21:12:25.816406455Z time="2023-08-18T21:12:25Z" level=info msg="Config change detected" configChangeCtr=2 curConfig="{6443 9445 29445 [{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443}

      {sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443} {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}] }"
      2023-08-18T21:12:25.919248671Z time="2023-08-18T21:12:25Z" level=info msg="Removing existing nat PREROUTING rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT"
      2023-08-18T21:12:25.965663811Z time="2023-08-18T21:12:25Z" level=info msg="Removing existing nat OUTPUT rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT -o lo"
      2023-08-18T21:12:32.005310398Z time="2023-08-18T21:12:32Z" level=info msg="Config change detected" configChangeCtr=3 curConfig="{6443 9445 29445 [{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443} {sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443}

      {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}

      ] }"

      The data is being redirected

      found this in the sos report: sos_commands/firewall_tables/

      nft_-a_list_ruleset

      table ip nat { # handle 2
      chain PREROUTING

      { # handle 1 type nat hook prerouting priority dstnat; policy accept; meta l4proto tcp ip daddr 10.8.0.73 tcp dport 6443 counter packets 0 bytes 0 redirect to :9445 # handle 66 counter packets 82025408 bytes 5088067290 jump OVN-KUBE-ETP # handle 30 counter packets 82025421 bytes 5088068062 jump OVN-KUBE-EXTERNALIP # handle 28 counter packets 82025439 bytes 5088069114 jump OVN-KUBE-NODEPORT # handle 26 }

      chain INPUT

      { # handle 2 type nat hook input priority 100; policy accept; }

      chain POSTROUTING

      { # handle 3 type nat hook postrouting priority srcnat; policy accept; counter packets 245475292 bytes 16221809463 jump OVN-KUBE-EGRESS-SVC # handle 25 oifname "ovn-k8s-mp0" counter packets 58115015 bytes 4184247096 jump OVN-KUBE-SNAT-MGMTPORT # handle 16 counter packets 187360548 bytes 12037581317 jump KUBE-POSTROUTING # handle 10 }

      chain OUTPUT

      { # handle 4 type nat hook output priority -100; policy accept; oifname "lo" meta l4proto tcp ip daddr 10.8.0.73 tcp dport 6443 counter packets 0 bytes 0 redirect to :9445 # handle 67 counter packets 245122162 bytes 16200621351 jump OVN-KUBE-EXTERNALIP # handle 29 counter packets 245122163 bytes 16200621411 jump OVN-KUBE-NODEPORT # handle 27 counter packets 245122166 bytes 16200621591 jump OVN-KUBE-ITP # handle 24 }

      ... many more lines ...

      This code was not added by the customer

      None of the redirect statements are in the same file for 4.14 (the failing cluster)

      ocp 4.14: (if applicable):{code:none}
      
          

      How reproducible:100%

          Steps to Reproduce:{code:none}
      This is the install script that our ansible job uses to install 4.12
      
      If you need it cleared up let me know, all the items in {{}} are just variables for file paths
      
      cp -r {{  item.0.cluster_name }}/install-config.yaml {{ openshift_base }}{{  item.0.cluster_name }}/
      ./openshift-install create manifests --dir {{ openshift_base }}{{  item.0.cluster_name }}/
      cp -r machineconfigs/* {{ openshift_base }}{{  item.0.cluster_name }}/openshift/
      cp -r {{  item.0.cluster_name }}/customizations/* {{ openshift_base }}{{  item.0.cluster_name }}/openshift/
      ./openshift-install create ignition-configs --dir {{ openshift_base }}{{  item.0.cluster_name }}/
      ./openshift-install create cluster --dir {{ openshift_base }}{{  item.0.cluster_name }} --log-level=debug
      
      We are installing IPI on vmware
      
      API and Ingress VIPs are configured on our external load balancer appliance. (Citrix ADCs if that matters)
      
      
          

      Actual results:

      
      haproxy pods crashloop and do not work
      In 4.14 following the same install workflow neither the API or Ingress IP binds to masters or workers and we see HAPROXY crashlooping
          

      Expected results:

      
      for 4.12
      Following a completion of 4.12 if we look in vmware at our master and worker nodes we will see all of them have an IP address from the machine network assigned to them, and one node from both masters and workers will have the VIP bound to them as well.  
       
      
          

      Additional info:

      
          

            mkowalsk@redhat.com Mat Kowalski
            rhn-support-brstone Brian Stone
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: