Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32333

Correctly handle HCP subnet tagging for management clusters to not break cloud-provider-aws subnet selection

    XMLWordPrintable

Details

    • Moderate
    • No
    • Hypershift Sprint 252, Hypershift Sprint 253
    • 2
    • False
    • Hide

      None

      Show
      None

    Description

      This is a clone of issue OCPBUGS-30260. The following is the description of the original issue:

      Description of problem:

      Hypershift management clusters are using a network-load-balancer to route to their own openshift-ingress router pods for cluster ingress.
      These NLBs are provisioned by the https://github.com/openshift/cloud-provider-aws.
      
      The cloud-provider-aws uses the cluster-tag on the subnets to select the correct subnets for the NLB and the SecurityGroup adjustments.
      
      On management clusters *all* subnets are tagged with the MCs cluster-id.
      This can lead to the cloud-provider-aws to possibly selecting the incorrect subnet, due to breaking conflicts between multiple subnets in an AZ using lexicographical comparisons: https://github.com/openshift/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3626
      
      This can lead to a situation, where a SecurityGroup will only allow Ingress from a subnet that is not actually part of the NLB - in this case the TargetGroup will not be able to correctly perform a HealthCheck in that AZ.
      
      In certain cases this can lead to all targets reporting unhealthy as the nodes hosting the ingress pods have the incorrect SecurityGroup rules.
      
      In that case routing to nodes that are part of the target group can select nodes that should not be chosen as they are not ready yet/anymore leading to problems when attempting to access management cluster services (e.g. the console).

      Version-Release number of selected component (if applicable):

      4.14.z & 4.15.z

      How reproducible:

      Most MCs that are using NLBs will have some of the SecurityGroups misconfigured.

      Steps to Reproduce:

      1. Have the cloud-provider-aws update the NLB while there are subnets in an AZ with lexicographically smaller names than the MCs default subnets - this can lead to the other subnets being chosen instead.
      2. Check the securitygroups to see if the source CIDRs are incorrect.

      Actual results:

      SecurityGroups can have incorrect source CIDRs used for the MCs own NLB.

      Expected results:

      The MC should only tag their own subnet with the clusterid of the MC, so subnet selection of the cloud-provider-aws is not affected by the HCP subnets in the same availability zones. 

      Additional info:

      Related OHSS ticket from SREP: https://issues.redhat.com/browse/OSD-20289

      Attachments

        Issue Links

          Activity

            People

              cewong@redhat.com Cesar Wong
              openshift-crt-jira-prow OpenShift Prow Bot
              Jie Zhao Jie Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: