Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17432

Service LB is adding all subnets from VPC in BYOVPC installations using limited subnets

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The Service LB is always adding all subnets available in the VPC to the load balancer in BYOVPC installations, even if the installation sets a limited number of subnets in install-config.yaml.
      
      The component directly affected is the service LB for the ingress controller, default router.
      
      The installation is working successfully, in conformance with the documentation[0]:
      
      A) "Your VPC must meet the following characteristics:
      - [...] Each availability zone can contain no more than one public and one private subnet [...] 
      
      VPC validation
      - [...] Otherwise, provide exactly one public and private subnet for each availability zone."
      
      B) "Isolation between clusters
      - You can install multiple OpenShift Container Platform clusters in the same VPC."
      
      But when reading the use cases below it seems the goal is not completely achieved:
      
        - Goal-1: As a OCP administrator I want to use the existing VPC, created with many subnets across many zones, to install a cluster in a single zone (single subnet added in install-config.yaml, paragraph [B] above), so I can make sure all my components will use that target architecture, including ingress load balancers.
        - Goal-2: As a OCP administrator with a cluster created by installer in private subnet installed in an existing VPC, I would like to create and use public subnets as Day 2 to reconfigure my ingress and expose applications only in the subnets of zones my cluster was created, and has worker nodes running.
        - Goal-3: As a OCP administrator I would like to deploy one cluster by availability zone in a single VPC, so I can manage my network while each OpenShift cluster will be isolated and use the resources selected in the install time.
      
      Example installing three clusters (C1, C2, and C3) in existing VPC with 3 subnets (SB1, SB2, and SB3), each by zone/subnet with config:
        - The cluster C1 with SB1 in platform.aws.subnets of install-config.yaml, is installed and the service LB assigns subnets SB1, SB2, and SB3
        - The cluster C2 with SB2 in platform.aws.subnets of install-config.yaml, is installed and the service LB assigns subnets SB2 and SB3
        - The cluster C3 with SB3 in platform.aws.subnets of install-config.yaml, is installed and the service LB assigns subnet SB3.
      
      > See more detailed steps in the linked document in the steps to reproduce section.
      
      Looking to the cloud provider implementation[3], it seems it will discover all subnets with the cluster tag, or empty cluster tag[4] - reproducing the behavior for the goals above, and attaching undesired subnets to the LB. The well-known subnet selection tags[5] are not evaluated in this situation[6], as it used only to select when there are more than one subnet in the same zone.
      
      I can see two unexpected behaviors here:
      - Issue 1) OCP cluster is installed in limited set of subnets from existing VPC, but the controller is not selecting correctly those subnets to create the service load balancers.
      - Issue 2) Whether the decision causing Issue 1, the description of the LB tags ("is the tag name used on a subnet to designate that it should be used for internal ELBs") seems not appropriate to the current behavior ("used to choose between subnets in the same zone, when there is more than one subnet created, otherwise any subnet in the zone will be selected when has cluster tag with InfraID, or hasn't tags")
      
      As a workaround, the user can set the subnets to "unmanaged" tags, cluster tag kubernetes.io/cluster/unmanaged=.*, preventing the controller from selecting subnets that weren't already installed a cluster, or are unwanted to be used in the service load balancer. Bringing to the example above, to make Cluster C1 use only SB1 (when C2 and C3 haven't installed yet):
      
      ~~~
      $ aws ec2 create-tags --resources SB2-private SB2-public SB3-private SB3-public --tags Key=kubernetes.io/cluster/unmanaged,Value=shared
      ~~~

      Version-Release number of selected component (if applicable):

      Any. Recently tested in 4.13.3

      How reproducible:

      Always

      Steps to Reproduce:

      1. Create VPC and subnets in more than one zone.
      2. Create the install-config.yaml to install a cluster in single zone. Setting the private (and optionally public) subnet in `platform.aws.subnets`
      3. Install a cluster. Expected the installation finished successfully and nodes must be in single zone.
      4. Check the subnets attached to the ingress's LB, it will have more than one zone/subnet:
      ~~~
      $ ROUTER_LB_HOSTNAME=$(oc get svc -n openshift-ingress -o json | jq -r '.items[] | select (.spec.type=="LoadBalancer").status.loadBalancer.ingress[0].hostname')
      
      $ aws elb describe-load-balancers | jq -cr ".LoadBalancerDescriptions[] | select (.DNSName==\"${ROUTER_LB_HOSTNAME}\") | [.DNSName, .AvailabilityZones]"
      ~~~
      5. Check the subnet tags. It must have only cluster tags added by installer:
      ~~~
      $ aws ec2 describe-subnets --filter Name=vpc-id,Values=$VPC_ID | jq -cr '.Subnets[] | [.AvailabilityZone, .SubnetId, [ .Tags[] | select(.Key | contains("kubernetes.io/cluster") ) ] ] '
      ["us-east-1b","subnet-0ea185a708c614460",[]]
      ["us-east-1c","subnet-05a0fd009d241519b",[]]
      ["us-east-1a","subnet-0b60120c2f1a84786",[{"Key":"kubernetes.io/cluster/byonetd2c-use1a-tlwhm","Value":"shared"}]]
      ["us-east-1c","subnet-076f0d446724ca0ec",[]]
      ["us-east-1b","subnet-015cadfbc60670fee",[]]
      ["us-east-1a","subnet-0e5b32c630126f94c",[{"Key":"kubernetes.io/cluster/byonetd2c-use1a-tlwhm","Value":"shared"}]]
      ~~~
      
      Detailed hands on steps reproducing three zone/subnet installaing three clusters, one by zone: https://github.com/mtulio/mtulio.labs/blob/37b092f0fa475b9e8f5c1944554b74e807f1629d/docs/guides/ocp-aws-vpc-single-az.md

      Actual results:

      ROUTER_LB_HOSTNAME=$(oc get svc -n openshift-ingress -o json | jq -r '.items[] | select (.spec.type=="LoadBalancer").status.loadBalancer.ingress[0].hostname')
      
      aws elbv2 describe-load-balancers | jq -r ".LoadBalancers[] | select (.DNSName==\"${ROUTER_LB_HOSTNAME}\") | [.DNSName, .AvailabilityZones]"
      [
        "ac4c2574bdd6a41b5b13655cf4df678a-b5d2ea662f04c448.elb.us-east-1.amazonaws.com",
        [
          {
            "ZoneName": "us-east-1a",
            "SubnetId": "subnet-03d8721dd76527772",
            "LoadBalancerAddresses": []
          },
          {
            "ZoneName": "us-east-1b",
            "SubnetId": "subnet-034e4acc93f527772",
            "LoadBalancerAddresses": []
          },
          {
            "ZoneName": "us-east-1c",
            "SubnetId": "subnet-058247e779d805066",
            "LoadBalancerAddresses": []
          }
        ]
      ]

      Expected results:

      Only subnet/AZs added to installer attached to the Load Balancer. Or;
      A option to select the subnets to be attached - as proposed by subnet tags.

      Additional info:

      [0] Supported path of installingin existing VPC: https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-vpc.html
      [1] Customer failing when trying to install many cluster in single VPC, one cluster by subnet.
      [2] Steps to reproduce and workaround: https://github.com/mtulio/mtulio.labs/blob/37b092f0fa475b9e8f5c1944554b74e807f1629d/docs/guides/ocp-aws-vpc-single-az.md
      [3] cloud provider implementation: https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3526
      [4] find and continuehttps://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3469-L3472https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3559-L3562
      [5] https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L92-L98[6] https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3565-L3591
      
      Additional references:
      - https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.4/deploy/subnet_discovery/
      
      Slack threads discussed about this topic:
      - https://redhat-internal.slack.com/archives/CCH60A77E/p1686337113958969
      - https://redhat-internal.slack.com/archives/CBZHF4DHC/p1686331622320159
      - Search for: "kubernetes.io/role/elb" and "/deploy/subnet_discovery/"
      
      

            joelspeed Joel Speed
            rhn-support-mrbraga Marco Braga
            Huali Liu Huali Liu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: