Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30260

Correctly handle HCP subnet tagging for management clusters to not break cloud-provider-aws subnet selection

XMLWordPrintable

    • Moderate
    • No
    • Hypershift Sprint 250, Hypershift Sprint 251, Hypershift Sprint 252, Hypershift Sprint 253
    • 4
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Hypershift management clusters are using a network-load-balancer to route to their own openshift-ingress router pods for cluster ingress.
      These NLBs are provisioned by the https://github.com/openshift/cloud-provider-aws.
      
      The cloud-provider-aws uses the cluster-tag on the subnets to select the correct subnets for the NLB and the SecurityGroup adjustments.
      
      On management clusters *all* subnets are tagged with the MCs cluster-id.
      This can lead to the cloud-provider-aws to possibly selecting the incorrect subnet, due to breaking conflicts between multiple subnets in an AZ using lexicographical comparisons: https://github.com/openshift/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3626
      
      This can lead to a situation, where a SecurityGroup will only allow Ingress from a subnet that is not actually part of the NLB - in this case the TargetGroup will not be able to correctly perform a HealthCheck in that AZ.
      
      In certain cases this can lead to all targets reporting unhealthy as the nodes hosting the ingress pods have the incorrect SecurityGroup rules.
      
      In that case routing to nodes that are part of the target group can select nodes that should not be chosen as they are not ready yet/anymore leading to problems when attempting to access management cluster services (e.g. the console).

      Version-Release number of selected component (if applicable):

      4.14.z & 4.15.z

      How reproducible:

      Most MCs that are using NLBs will have some of the SecurityGroups misconfigured.

      Steps to Reproduce:

      1. Have the cloud-provider-aws update the NLB while there are subnets in an AZ with lexicographically smaller names than the MCs default subnets - this can lead to the other subnets being chosen instead.
      2. Check the securitygroups to see if the source CIDRs are incorrect.

      Actual results:

      SecurityGroups can have incorrect source CIDRs used for the MCs own NLB.

      Expected results:

      The MC should only tag their own subnet with the clusterid of the MC, so subnet selection of the cloud-provider-aws is not affected by the HCP subnets in the same availability zones. 

      Additional info:

      Related OHSS ticket from SREP: https://issues.redhat.com/browse/OSD-20289

            cewong@redhat.com Cesar Wong
            fbergmann.openshift Florian Bergmann
            Jie Zhao Jie Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated: