Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-67191

Intermittent external connectivity failures (EOF/TLS errors) due to ovn-controller releasing/claiming lports unexpectedly on OCP 4.18/4.19 (ARO)

XMLWordPrintable

    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • All
    • Production
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Customer is experiencing intermittent external connectivity failures after upgrading to OCP 4.18 and 4.19 (ARO environment). The issue affects any HTTPS traffic going through the cluster (ArgoCD CLI, oc login --web, console login, and API calls). Approximately every second or third request fails with EOF, SSL_ERROR_SYSCALL, or TCP resets.

      Network must-gather and OVN logs show abnormal ovn-controller behavior, including:

      • Frequent memory_trim cycles every 30 seconds, suggesting event backlog or inactivity.
      • ovn-controller unexpectedly releasing and reclaiming logical ports (lports) for active pods.
      • Transitions of lport state to down and up without pod restarts.
      • Possible temporary desync with Southbound DB.
      • Inconsistent NAT/SNAT rules observed during failing windows.

      These symptoms cause intermittent loss of routing/NAT flows and unstable external connectivity.

      This does NOT match the behavior of OCPBUGS-16217 exactly and appears to be either a regression in OVN-Kubernetes for 4.18/4.19 or a new bug introduced in OVN-IC for these versions.

      Impact is high: the customer’s CI/CD pipelines are blocked, ArgoCD cannot be used reliably, and user login flows intermittently fail.

      Need Engineering support to analyze OVN controller behavior and determine whether this is a regression or a new OVN-Kubernetes issue.

      Version-Release number of selected component (if applicable):

      How reproducible:

      The issue is consistently reproducible.
      Any external HTTPS/API request made to the cluster (ArgoCD CLI, oc login --web, or console access) fails intermittently but predictably, with approximately 1 out of every 2–3 requests returning:

      EOF
      rpc error: code = Unknown desc = EOF
      SSL_ERROR_SYSCALL
      or TCP connection resets (RST)

      The behavior is reproducible from:

      • Azure DevOps agents
      • Customer local environment
      • Multiple networks and clients

      Reproduction is also observed when repeatedly running:

      argocd login
      argocd app get <app>
      oc login --web
      curl -vk https://<gitops-server-route>/

      Thus, while each individual request may succeed, the intermittent failures occur every time a sequence of multiple requests is made, making the issue fully reproducible.

      Steps to Reproduce:

      Deploy or upgrade an OpenShift cluster to 4.18.x or 4.19.x (ARO).

      Ensure OVN-Kubernetes is configured with standard routing and external access patterns (default for ARO).

      From an external client (Azure DevOps agent or local workstation), run repeated HTTPS/API requests against a cluster route or API endpoint, for example:

      argocd login <gitops-url>
      argocd app get <app>
      oc login --web
      curl -vk https://openshift-gitops-server-openshift-gitops.apps.<cluster>/

      Observe that approximately every second or third request fails with:

      EOF
      rpc error: code = Unknown desc = Post ... EOF
      SSL_ERROR_SYSCALL
      TCP resets

      Check OVN logs on an affected node and observe:

      Frequent memory_trim events
      ovn-controller releasing and reclaiming lports unexpectedly
      lport state transitions (Setting lport ... down)
      Signs of intermittent desync with Southbound DB

      Removing and reapplying the k8s.ovn.org/egress-assignable label temporarily restores stability, but the issue returns after node or ovnkube-node restarts.

              bbennett@redhat.com Ben Bennett
              rhn-support-ravellan Ronald Avellaneda
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: