Loading...

XML

Word

Printable

Type: Bug
Resolution: Obsolete
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14.z, 4.15.z
Component/s: HyperShift
Labels:

Severity:
Moderate
Regression:
No
Sprint:
Hypershift Sprint 252, Hypershift Sprint 253, Hypershift Sprint 254, Hypershift Sprint 255
sprint_count:
4
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.15.z
Target Backport Versions:

4.14.z, 4.15.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue OCPBUGS-30260. The following is the description of the original issue:
—
Description of problem:

Hypershift management clusters are using a network-load-balancer to route to their own openshift-ingress router pods for cluster ingress.
These NLBs are provisioned by the https://github.com/openshift/cloud-provider-aws.

The cloud-provider-aws uses the cluster-tag on the subnets to select the correct subnets for the NLB and the SecurityGroup adjustments.

On management clusters *all* subnets are tagged with the MCs cluster-id.
This can lead to the cloud-provider-aws to possibly selecting the incorrect subnet, due to breaking conflicts between multiple subnets in an AZ using lexicographical comparisons: https://github.com/openshift/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3626

This can lead to a situation, where a SecurityGroup will only allow Ingress from a subnet that is not actually part of the NLB - in this case the TargetGroup will not be able to correctly perform a HealthCheck in that AZ.

In certain cases this can lead to all targets reporting unhealthy as the nodes hosting the ingress pods have the incorrect SecurityGroup rules.

In that case routing to nodes that are part of the target group can select nodes that should not be chosen as they are not ready yet/anymore leading to problems when attempting to access management cluster services (e.g. the console).

Version-Release number of selected component (if applicable):

4.14.z & 4.15.z

How reproducible:

Most MCs that are using NLBs will have some of the SecurityGroups misconfigured.

Steps to Reproduce:

1. Have the cloud-provider-aws update the NLB while there are subnets in an AZ with lexicographically smaller names than the MCs default subnets - this can lead to the other subnets being chosen instead.
2. Check the securitygroups to see if the source CIDRs are incorrect.

Actual results:

SecurityGroups can have incorrect source CIDRs used for the MCs own NLB.

Expected results:

The MC should only tag their own subnet with the clusterid of the MC, so subnet selection of the cloud-provider-aws is not affected by the HCP subnets in the same availability zones.

Additional info:

Related OHSS ticket from SREP: https://issues.redhat.com/browse/OSD-20289

clones

OCPBUGS-30260 Correctly handle HCP subnet tagging for management clusters to not break cloud-provider-aws subnet selection

POST

is blocked by

OCPBUGS-30260 Correctly handle HCP subnet tagging for management clusters to not break cloud-provider-aws subnet selection

POST

Assignee:: Cesar Wong

Reporter:: OpenShift Prow Bot

QA Contact:: Jie Zhao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/04/16 9:38 PM

Updated:: 2024/07/02 1:57 PM

Resolved:: 2024/07/02 1:57 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates