Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.14.z, 4.15.z
Component/s: HyperShift
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:

4.14.z, 4.15.z, 4.17.z, 4.16.z
Target Version:

4.18.z
Release Blocker:
None
Sprint:
Hypershift Sprint 250, Hypershift Sprint 251, Hypershift Sprint 252, Hypershift Sprint 253, Hypershift Sprint 254, Hypershift Sprint 255, Hypershift Sprint 256, Hypershift Sprint 257, Hypershift Sprint 258, Hypershift Sprint 259, Hypershift Sprint 260, Hypershift Sprint 261, Hypershift Sprint 262, Hypershift Sprint 263
sprint_count:
14

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
Done
Release Note Type:
Release Note Not Required
Release Note Text:
N/A

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Hypershift management clusters are using a network-load-balancer to route to their own openshift-ingress router pods for cluster ingress.
These NLBs are provisioned by the https://github.com/openshift/cloud-provider-aws.

The cloud-provider-aws uses the cluster-tag on the subnets to select the correct subnets for the NLB and the SecurityGroup adjustments.

On management clusters *all* subnets are tagged with the MCs cluster-id.
This can lead to the cloud-provider-aws to possibly selecting the incorrect subnet, due to breaking conflicts between multiple subnets in an AZ using lexicographical comparisons: https://github.com/openshift/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3626

This can lead to a situation, where a SecurityGroup will only allow Ingress from a subnet that is not actually part of the NLB - in this case the TargetGroup will not be able to correctly perform a HealthCheck in that AZ.

In certain cases this can lead to all targets reporting unhealthy as the nodes hosting the ingress pods have the incorrect SecurityGroup rules.

In that case routing to nodes that are part of the target group can select nodes that should not be chosen as they are not ready yet/anymore leading to problems when attempting to access management cluster services (e.g. the console).

Version-Release number of selected component (if applicable):

4.14.z & 4.15.z

How reproducible:

Most MCs that are using NLBs will have some of the SecurityGroups misconfigured.

Steps to Reproduce:

1. Have the cloud-provider-aws update the NLB while there are subnets in an AZ with lexicographically smaller names than the MCs default subnets - this can lead to the other subnets being chosen instead.
2. Check the securitygroups to see if the source CIDRs are incorrect.

Actual results:

SecurityGroups can have incorrect source CIDRs used for the MCs own NLB.

Expected results:

The MC should only tag their own subnet with the clusterid of the MC, so subnet selection of the cloud-provider-aws is not affected by the HCP subnets in the same availability zones.

Additional info:

Related OHSS ticket from SREP: https://issues.redhat.com/browse/OSD-20289

blocks

OCPBUGS-32333 Correctly handle HCP subnet tagging for management clusters to not break cloud-provider-aws subnet selection

Closed

is cloned by

OCPBUGS-32333 Correctly handle HCP subnet tagging for management clusters to not break cloud-provider-aws subnet selection

Closed

is related to

OCPBUGS-30821 NLB not deploying to the correct subnets provided during installation

Closed

links to

openshift/hypershift#3746: OCPBUGS-30260: Support specifying AWS LB subnets

openshift/hypershift#3767: OCPBUGS-30260: Support subnet labels separated by periods

openshift/hypershift#3944: OCPBUGS-30260: Ignore subnet annotations for control plane load balancers

openshift/hypershift#3946: OCPBUGS-30260: Set load balancer target nodes in request serving isolation mode

RHBA-2025:8560 OpenShift Container Platform 4.18.17 bug fix update

(3 links to)

Assignee:: Cesar Wong

Reporter:: Florian Bergmann

QA Contact:: Jie Zhao

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Created:: 2024/03/05 3:08 PM

Updated:: 2025/07/23 11:41 AM

Resolved:: 2025/06/10 6:26 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates