Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32184

Upgrades failing in AWS when DHCP Option Set uses user defined 'domain-name'

    XMLWordPrintable

Details

    • Critical
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      I'm running on AWS and using IPI.  
      The install is customized to allow me to install into pre-created VPC.
      The VPC has a DHCP option set where the value of 'domain-name' is changed from the default of 'eu-west-2.compute.internal' to 'ice-aws.cloud'.
      Only other option set in the dhcp option set is 'domain-name-servers: AmazonProvidedDNS'.
      
      During an upgrade from 4.13.0-0.okd-2023-10-28-065448 to 4.14.0-0.okd-2023-11-12-042703 one of the masters started complaining that it was unable to register the node with the API server.
      
      When I look at the [kubernetes.io/hostname] label on the nodes they are set to ip-x-x-x-x.ice-aws.cloud. When the issue arises where the node can't register with the API server, the error message says:kubelet_node_status.go:72] "Attempting to register node" node="ip-10-38-20-100.ice-aws.cloud"
      1623Jan 17 12:00:32.044690 ip-10-38-20-100 kubenswrapper[1408]: E0117 12:00:32.044672 1408 kubelet_node_status.go:94] "Unable to register node with API server" err="nodes "ip-10-38-20-100.ice-aws.cloud" is forbidden: node "ip-10-38-20-100.eu-west-2.compute.internal" is not allowed to modify node "ip-10-38-20-100.ice-aws.cloud"" node="ip-10-38-20-100.ice-aws.cloud"
      
      I've destroyed and run this through several times testing different things and have found the following:
      
      1. If I use a VPC with the default DHCP option set values of 'domain-name: eu-west-2.compute.internal' and 'domain-name-servers: AmazonProvidedDNS' all upgrades work fine.
      2. All upgrades on 4.13 work fine even with the custom domain-name value in the dhcp option set.
      3. When describing the failing master node during upgrade to 4.14, the 'Addresses' values of 'Hostname' and 'InternalDNS' change several times.  It flips between the default value of eu-west-2.compute.internal and the custom value of ice-aws.cloud.
      4. When the upgrade starts to fail, there are some Pending CSR's for a new master node.  Approving them creates a 4th master with the custom domain name as its suffix, shown below:
      
      oc get nodes
      NAME                                          STATUS                        ROLES                  AGE     VERSION
      ip-10-115-20-120.eu-west-2.compute.internal   Ready                         worker                 3h2m    v1.26.9+636f2be
      ip-10-115-20-24.eu-west-2.compute.internal    NotReady,SchedulingDisabled   control-plane,master   3h13m   v1.26.9+636f2be
      ip-10-115-20-24.ice-aws.cloud                 Ready                         control-plane,master   2m58s   v1.27.6+b49f9d1
      ip-10-115-21-158.eu-west-2.compute.internal   Ready                         worker                 3h4m    v1.26.9+636f2be
      ip-10-115-21-62.eu-west-2.compute.internal    Ready                         control-plane,master   3h13m   v1.26.9+636f2be
      ip-10-115-22-124.eu-west-2.compute.internal   Ready                         control-plane,master   3h13m   v1.26.9+636f2be
      ip-10-115-22-137.eu-west-2.compute.internal   Ready                         worker                 3h4m    v1.26.9+636f2be

      Version-Release number of selected component (if applicable):

      4.14.0-0.okd-2023-11-12-042703

      How reproducible:

      100%

      Steps to Reproduce:

      1.  Create VPC using a DHCP option set with the domain-name value set to some custom.
      2.  Install version 4.13.0-0.okd-2023-10-28-065448 customized IPI adjusting the install-config.yaml to use your custom VPC/subnets etc.
      3.  Once install is complete, upgrade to 4.14.0-0.okd-2023-11-12-042703.  This can either be full upgrade or control-plane first.  Both fail.     

      Actual results:

      Cluster upgrade won't go past 32 of 33 cluster operators upgraded.
      
      Many of the cluster operators are either in a 'Degraded' state or 'Cannot update' state.

      Expected results:

      Cluster should upgrade to version 4.14.0-0.okd-2023-11-12-042703 successfully.

      Additional info:

      In install-config.yaml I have the following set:
      
      networkType: OpenShiftSDN
      
      and
      
      publish: Internal
      
      I'm also setting my subnets and machineNetwork to match my pre-created VPC.
      
      In manifest file cluster-network-03-config.yaml I have the following set:
      
      apiVersion: operator.openshift.io/v1
      
      I also have the following set in the same file:
      
        networkType: OpenShiftSDN
        defaultNetwork:
          type: OpenShiftSDN
          openshiftSDNConfig:
            mode: Multitenant
      
      In cluster-ingress-default-ingresscontroller.yaml is have the following set:
      
        endpointPublishingStrategy:
          loadBalancer:
            scope: Internal
            providerParameters:
              type: AWS
              aws:
                type: NLB
      
      I have also registered this issue here https://github.com/okd-project/okd/issues/1921.

      Attachments

        Activity

          People

            joelspeed Joel Speed
            brynjellis_iit Bryn Ellis
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: