Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.14
-
None
-
Critical
-
No
-
False
-
Description
Description of problem:
I'm running on AWS and using IPI. The install is customized to allow me to install into pre-created VPC. The VPC has a DHCP option set where the value of 'domain-name' is changed from the default of 'eu-west-2.compute.internal' to 'ice-aws.cloud'. Only other option set in the dhcp option set is 'domain-name-servers: AmazonProvidedDNS'. During an upgrade from 4.13.0-0.okd-2023-10-28-065448 to 4.14.0-0.okd-2023-11-12-042703 one of the masters started complaining that it was unable to register the node with the API server. When I look at the [kubernetes.io/hostname] label on the nodes they are set to ip-x-x-x-x.ice-aws.cloud. When the issue arises where the node can't register with the API server, the error message says:kubelet_node_status.go:72] "Attempting to register node" node="ip-10-38-20-100.ice-aws.cloud" 1623Jan 17 12:00:32.044690 ip-10-38-20-100 kubenswrapper[1408]: E0117 12:00:32.044672 1408 kubelet_node_status.go:94] "Unable to register node with API server" err="nodes "ip-10-38-20-100.ice-aws.cloud" is forbidden: node "ip-10-38-20-100.eu-west-2.compute.internal" is not allowed to modify node "ip-10-38-20-100.ice-aws.cloud"" node="ip-10-38-20-100.ice-aws.cloud" I've destroyed and run this through several times testing different things and have found the following: 1. If I use a VPC with the default DHCP option set values of 'domain-name: eu-west-2.compute.internal' and 'domain-name-servers: AmazonProvidedDNS' all upgrades work fine. 2. All upgrades on 4.13 work fine even with the custom domain-name value in the dhcp option set. 3. When describing the failing master node during upgrade to 4.14, the 'Addresses' values of 'Hostname' and 'InternalDNS' change several times. It flips between the default value of eu-west-2.compute.internal and the custom value of ice-aws.cloud. 4. When the upgrade starts to fail, there are some Pending CSR's for a new master node. Approving them creates a 4th master with the custom domain name as its suffix, shown below: oc get nodes NAME STATUS ROLES AGE VERSION ip-10-115-20-120.eu-west-2.compute.internal Ready worker 3h2m v1.26.9+636f2be ip-10-115-20-24.eu-west-2.compute.internal NotReady,SchedulingDisabled control-plane,master 3h13m v1.26.9+636f2be ip-10-115-20-24.ice-aws.cloud Ready control-plane,master 2m58s v1.27.6+b49f9d1 ip-10-115-21-158.eu-west-2.compute.internal Ready worker 3h4m v1.26.9+636f2be ip-10-115-21-62.eu-west-2.compute.internal Ready control-plane,master 3h13m v1.26.9+636f2be ip-10-115-22-124.eu-west-2.compute.internal Ready control-plane,master 3h13m v1.26.9+636f2be ip-10-115-22-137.eu-west-2.compute.internal Ready worker 3h4m v1.26.9+636f2be
Version-Release number of selected component (if applicable):
4.14.0-0.okd-2023-11-12-042703
How reproducible:
100%
Steps to Reproduce:
1. Create VPC using a DHCP option set with the domain-name value set to some custom. 2. Install version 4.13.0-0.okd-2023-10-28-065448 customized IPI adjusting the install-config.yaml to use your custom VPC/subnets etc. 3. Once install is complete, upgrade to 4.14.0-0.okd-2023-11-12-042703. This can either be full upgrade or control-plane first. Both fail.
Actual results:
Cluster upgrade won't go past 32 of 33 cluster operators upgraded. Many of the cluster operators are either in a 'Degraded' state or 'Cannot update' state.
Expected results:
Cluster should upgrade to version 4.14.0-0.okd-2023-11-12-042703 successfully.
Additional info:
In install-config.yaml I have the following set: networkType: OpenShiftSDN and publish: Internal I'm also setting my subnets and machineNetwork to match my pre-created VPC. In manifest file cluster-network-03-config.yaml I have the following set: apiVersion: operator.openshift.io/v1 I also have the following set in the same file: networkType: OpenShiftSDN defaultNetwork: type: OpenShiftSDN openshiftSDNConfig: mode: Multitenant In cluster-ingress-default-ingresscontroller.yaml is have the following set: endpointPublishingStrategy: loadBalancer: scope: Internal providerParameters: type: AWS aws: type: NLB I have also registered this issue here https://github.com/okd-project/okd/issues/1921.