Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18771

Keepalived pods crashes and fail to start on worker node (Ingress VIP)

XMLWordPrintable

    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, a new logic introduced for determining where to run the Keepalived process did not consider the ingress VIP or VIPs. As a result, the Keepalived pods might not have ran on ingress nodes, which could break the cluster. With this fix, the logic now includes the ingress VIP or VIPs, and the Keepalived pods should always be available. (link:https://issues.redhat.com/browse/OCPBUGS-18771[*OCPBUGS-18771*])
      Show
      * Previously, a new logic introduced for determining where to run the Keepalived process did not consider the ingress VIP or VIPs. As a result, the Keepalived pods might not have ran on ingress nodes, which could break the cluster. With this fix, the logic now includes the ingress VIP or VIPs, and the Keepalived pods should always be available. (link: https://issues.redhat.com/browse/OCPBUGS-18771 [* OCPBUGS-18771 *])
    • Bug Fix
    • Done

      Description of problem:

      Customer reported that keepalived pods crashes and fail to start on worker node (Ingress VIP). The expectation is that the keepalived pod (labeled by app=kni-infra-vrrp) should start. This affects everyone using OCP v4.13 together with Ingress VIP and could be a potential bug in the nodeip-configuration service in v4.13.

      More details as below:

      -> There are 2 problems in OCP v4.13. The regexp expression won't match and the chroot command will fail because of missing ldd libraries inside the container. This has been fixed on 4.14, but not on 4.13.

      -> The nodeip-configuration service creates the /run/nodeip-configuration/remote-worker file based on onPremPlatformAPIServerInternalIPs (apiVIP) and ignores the onPremPlatformIngressIPs (ingressVIP) as can be seen in  source code.

      -> Then the keepalived process wont start because the remote-worker file exists.

      -> The liveness probes will fail because the keepalived process does not exist.

      The fix is quite simple(as highlighted by the customer),  The nodeip-configuration.service template needs to be to extended to consider the Ingress VIPs as well. This is the source code where changes need to be done 

      As per the following code snippet, The NODE-IP ranges only over the onPremPlatformAPIServerInternalIPs and ignores the onPremPlatformIngressIPs.

      node-ip \
          set \
          --platform {{ .Infra.Status.PlatformStatus.Type }} \
          {{if not (isOpenShiftManagedDefaultLB .) -}}
          --user-managed-lb \
          {{end -}}
          {{if or (eq .IPFamilies "IPv6") (eq .IPFamilies "DualStackIPv6Primary") -}}
          --prefer-ipv6 \
          {{end -}}
          --retry-on-failure \
          {{ range onPremPlatformAPIServerInternalIPs . }}{{.}} {{end}}; \
          do \
          sleep 5; \
          done" 

      Difference between OCPv 4.12 and v4.13 related to keepalived pod is also indicated in this image attached

      Version-Release number of selected component (if applicable):

      v4.13

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      The keepalived pods crashes and fail to start on worker node (Ingress VIP)

      Expected results:

      The expectation is that the keepalived pod (labeled by app=kni-infra-vrrp) should start.

      Additional info:

       

              bnemec@redhat.com Benjamin Nemec
              rhn-support-mmarkand Mridul Markandey
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: