Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.14, 4.15, 4.16
Component/s: Networking / multus
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:

4.16.0
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The cluster-network-operator in hypershift when templating in cluster resources does not use the node local address of the client side haproxy load balancer that runs on all nodes. This bypasses a level of health checks for the backend redundant apiserver addresses that is performed by the local kube-apiserver-proxy pods that run on every node in a hypershift environment. In environments where the backend api servers are not fronted through an additional cloud load balancer: this leads to a percentage of request failures from the in cluster components occuring when a control plane endpoint goes down even if other endpoints are available.

Version-Release number of selected component (if applicable):

  4.16 4.15 4.14

How reproducible:

    100%

Steps to Reproduce:

    1. Setup a hypershift cluster in a baremetal/non cloud environment where there are redundant API servers behind a DNS that point directly to the node IPs.
    2. Power down one of the control plane nodes
    3. Schedule workload into cluster that depends on kube-proxy and/or multus to setup networking configuration
    4. You will see errors like the following 
```
add): Multus: [openshiftai/moe-8b-cmisale-master-0/9c1fd369-94f5-481c-a0de-ba81a3ee3583]: error getting pod: Get "https://[p9d81ad32fcdb92dbb598-6b64a6ccc9c596bf59a86625d8fa2202-c000.us-east.satellite.appdomain.cloud]:30026/api/v1/namespaces/openshiftai/pods/moe-8b-cmisale-master-0?timeout=1m0s": dial tcp 192.168.98.203:30026: connect: timeout
```

Actual results:

    When a control plane node fails intermittent timeouts occur when kube-proxy/multus resolve the dns and a failed control plane node ip is returned

Expected results:

    No requests fail (which will occur if all traffic is routed through the node local load balancer instance

Additional info:

    Additionally: control plane components in the management cluster that live next to the apiserver are adding uneeded dependencies by using an external DNS entry to talk to the kube-apiserver when it can use the local kube-apiserver address to have it all go over cluster local networking

blocks

OCPBUGS-30927 Cluster-network-operator doesn't use node local kube-apiserver loadbalancer when templating in cluster resources

Closed

duplicates

OCPBUGS-31512 Cluster-network-operator doesn't use node local kube-apiserver loadbalancer when templating in cluster resources

Closed

is cloned by

OCPBUGS-30927 Cluster-network-operator doesn't use node local kube-apiserver loadbalancer when templating in cluster resources

Closed

OCPBUGS-31512 Cluster-network-operator doesn't use node local kube-apiserver loadbalancer when templating in cluster resources

Closed

links to

openshift/cluster-network-operator#2288: OCPBUGS-30103: ensure local networking deployments within hypershift use the client side load balancer to be resilient to control plane node failures

openshift/cluster-network-operator#2311: [release-4.14] OCPBUGS-30103: ensure local networking deployments within hypershift use the client side load balancer to be resilient to control plane node failures

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

(2 links to)

Assignee:: Tyler Lisowski (Inactive)

Reporter:: Tyler Lisowski (Inactive)

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/03/01 2:21 AM

Updated:: 2025/07/23 11:45 AM

Resolved:: 2024/06/27 11:40 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates