Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45978

openshift-router times out when IPI keepalived pods is running chk_ingress VRRP_Script

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      • Customer is running OCP 4.12.53 Baremetal IPI.
      • Sometimes when the integrated keepalived that handles the ingress VIP runs the chk_ingress VRRP script is failing due to timeout:
      vrrp_script chk_ingress {
          script "/usr/bin/timeout 0.9 /usr/bin/curl -o /dev/null -Lfs http://localhost:1936/healthz/ready"
          interval 1
          weight 20
          rise 3
          fall 2
      }
      2024-12-04T05:38:26.658757185+00:00 stderr F Wed Dec  4 05:38:26 2024: VRRP_Script(chk_ingress) failed (exited with status 124)
      • It has been observed that when this happens the openshift-router process seems to freeze for ~1.5s:
      2998696 05:38:24.766046 close(15)       = 0
      2998688 05:38:24.766091 <... nanosleep resumed>NULL) = 0
      2998696 05:38:24.766107 futex(0xc000588148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
      2998688 05:38:24.766118 futex(0x301d998, FUTEX_WAIT_PRIVATE, 0, {tv_sec=4, tv_nsec=155358380} <unfinished ...>
      2998692 05:38:26.415706 <... epoll_pwait resumed>[{events=EPOLLIN, data={u32=1197311848, u64=140644196386664}}], 128, 4155, NULL, 3813427388) = 1
      2998692 05:38:26.415804 futex(0x301d998, FUTEX_WAKE_PRIVATE, 1) = 1
      2998692 05:38:26.415859 accept4(7, {sa_family=AF_INET6, sin6_port=htons(45398), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 15
      2998692 05:38:26.415933 epoll_ctl(4, EPOLL_CTL_ADD, 15, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=1197309928, u64=140644196384744}} <unfinished ...>
      • We checked the metrics and there is no throttling, the customer told that the clusters where they are having this issue are mostly idle ones.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      In customer environment    

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          The openshift-router process fails the check conducted by the keepalived pods due to timeout, this makes the ingress VIP flapping.

      Expected results:

           The openshift-router process should be able to reply to the curl in less than 0.9s during normal operations.

      Additional info:

          

              mmasters1@redhat.com Miciah Masters
              fcristin1@redhat.com Francesco Cristini
              Melvin Joseph Melvin Joseph
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: