Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20025

Keepalived pods crashes and fail to start on worker node (Ingress VIP)

XMLWordPrintable

    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-18771. The following is the description of the original issue:

      Description of problem:

      Customer reported that keepalived pods crashes and fail to start on worker node (Ingress VIP). The expectation is that the keepalived pod (labeled by app=kni-infra-vrrp) should start. This affects everyone using OCP v4.13 together with Ingress VIP and could be a potential bug in the nodeip-configuration service in v4.13.

      More details as below:

      -> There are 2 problems in OCP v4.13. The regexp expression won't match and the chroot command will fail because of missing ldd libraries inside the container. This has been fixed on 4.14, but not on 4.13.

      -> The nodeip-configuration service creates the /run/nodeip-configuration/remote-worker file based on onPremPlatformAPIServerInternalIPs (apiVIP) and ignores the onPremPlatformIngressIPs (ingressVIP) as can be seen in  source code.

      -> Then the keepalived process wont start because the remote-worker file exists.

      -> The liveness probes will fail because the keepalived process does not exist.

      The fix is quite simple(as highlighted by the customer),  The nodeip-configuration.service template needs to be to extended to consider the Ingress VIPs as well. This is the source code where changes need to be done 

      As per the following code snippet, The NODE-IP ranges only over the onPremPlatformAPIServerInternalIPs and ignores the onPremPlatformIngressIPs.

      node-ip \
          set \
          --platform {{ .Infra.Status.PlatformStatus.Type }} \
          {{if not (isOpenShiftManagedDefaultLB .) -}}
          --user-managed-lb \
          {{end -}}
          {{if or (eq .IPFamilies "IPv6") (eq .IPFamilies "DualStackIPv6Primary") -}}
          --prefer-ipv6 \
          {{end -}}
          --retry-on-failure \
          {{ range onPremPlatformAPIServerInternalIPs . }}{{.}} {{end}}; \
          do \
          sleep 5; \
          done" 

      Difference between OCPv 4.12 and v4.13 related to keepalived pod is also indicated in this image attached

      Version-Release number of selected component (if applicable):

      v4.13

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      The keepalived pods crashes and fail to start on worker node (Ingress VIP)

      Expected results:

      The expectation is that the keepalived pod (labeled by app=kni-infra-vrrp) should start.

      Additional info:

       

            bnemec@redhat.com Benjamin Nemec
            openshift-crt-jira-prow OpenShift Prow Bot
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: