-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
4.19
-
Incidents & Support
-
False
-
-
None
-
Important
-
None
-
All
-
Production
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Customer is experiencing intermittent external connectivity failures after upgrading to OCP 4.18 and 4.19 (ARO environment). The issue affects any HTTPS traffic going through the cluster (ArgoCD CLI, oc login --web, console login, and API calls). Approximately every second or third request fails with EOF, SSL_ERROR_SYSCALL, or TCP resets.
Network must-gather and OVN logs show abnormal ovn-controller behavior, including:
- Frequent memory_trim cycles every 30 seconds, suggesting event backlog or inactivity.
- ovn-controller unexpectedly releasing and reclaiming logical ports (lports) for active pods.
- Transitions of lport state to down and up without pod restarts.
- Possible temporary desync with Southbound DB.
- Inconsistent NAT/SNAT rules observed during failing windows.
These symptoms cause intermittent loss of routing/NAT flows and unstable external connectivity.
This does NOT match the behavior of OCPBUGS-16217 exactly and appears to be either a regression in OVN-Kubernetes for 4.18/4.19 or a new bug introduced in OVN-IC for these versions.
Impact is high: the customer’s CI/CD pipelines are blocked, ArgoCD cannot be used reliably, and user login flows intermittently fail.
Need Engineering support to analyze OVN controller behavior and determine whether this is a regression or a new OVN-Kubernetes issue.
Version-Release number of selected component (if applicable):
How reproducible:
The issue is consistently reproducible.
Any external HTTPS/API request made to the cluster (ArgoCD CLI, oc login --web, or console access) fails intermittently but predictably, with approximately 1 out of every 2–3 requests returning:
EOF
rpc error: code = Unknown desc = EOF
SSL_ERROR_SYSCALL
or TCP connection resets (RST)
The behavior is reproducible from:
- Azure DevOps agents
- Customer local environment
- Multiple networks and clients
Reproduction is also observed when repeatedly running:
argocd login
argocd app get <app>
oc login --web
curl -vk https://<gitops-server-route>/
Thus, while each individual request may succeed, the intermittent failures occur every time a sequence of multiple requests is made, making the issue fully reproducible.
Steps to Reproduce:
Deploy or upgrade an OpenShift cluster to 4.18.x or 4.19.x (ARO).
Ensure OVN-Kubernetes is configured with standard routing and external access patterns (default for ARO).
From an external client (Azure DevOps agent or local workstation), run repeated HTTPS/API requests against a cluster route or API endpoint, for example:
argocd login <gitops-url>
argocd app get <app>
oc login --web
curl -vk https://openshift-gitops-server-openshift-gitops.apps.<cluster>/
Observe that approximately every second or third request fails with:
EOF
rpc error: code = Unknown desc = Post ... EOF
SSL_ERROR_SYSCALL
TCP resets
Check OVN logs on an affected node and observe:
Frequent memory_trim events
ovn-controller releasing and reclaiming lports unexpectedly
lport state transitions (Setting lport ... down)
Signs of intermittent desync with Southbound DB
Removing and reapplying the k8s.ovn.org/egress-assignable label temporarily restores stability, but the issue returns after node or ovnkube-node restarts.