Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Cloud Compute / Cloud Controller Manager
Labels:
None

Severity:
Critical
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:

Description

Description of problem:

I'm running on AWS and using IPI.
The install is customized to allow me to install into pre-created VPC.
The VPC has a DHCP option set where the value of 'domain-name' is changed from the default of 'eu-west-2.compute.internal' to 'ice-aws.cloud'.
Only other option set in the dhcp option set is 'domain-name-servers: AmazonProvidedDNS'.

During an upgrade from 4.13.0-0.okd-2023-10-28-065448 to 4.14.0-0.okd-2023-11-12-042703 one of the masters started complaining that it was unable to register the node with the API server.

When I look at the [kubernetes.io/hostname] label on the nodes they are set to ip-x-x-x-x.ice-aws.cloud. When the issue arises where the node can't register with the API server, the error message says:kubelet_node_status.go:72] "Attempting to register node" node="ip-10-38-20-100.ice-aws.cloud"
1623Jan 17 12:00:32.044690 ip-10-38-20-100 kubenswrapper[1408]: E0117 12:00:32.044672 1408 kubelet_node_status.go:94] "Unable to register node with API server" err="nodes "ip-10-38-20-100.ice-aws.cloud" is forbidden: node "ip-10-38-20-100.eu-west-2.compute.internal" is not allowed to modify node "ip-10-38-20-100.ice-aws.cloud"" node="ip-10-38-20-100.ice-aws.cloud"

I've destroyed and run this through several times testing different things and have found the following:

1. If I use a VPC with the default DHCP option set values of 'domain-name: eu-west-2.compute.internal' and 'domain-name-servers: AmazonProvidedDNS' all upgrades work fine.
2. All upgrades on 4.13 work fine even with the custom domain-name value in the dhcp option set.
3. When describing the failing master node during upgrade to 4.14, the 'Addresses' values of 'Hostname' and 'InternalDNS' change several times. It flips between the default value of eu-west-2.compute.internal and the custom value of ice-aws.cloud.
4. When the upgrade starts to fail, there are some Pending CSR's for a new master node. Approving them creates a 4th master with the custom domain name as its suffix, shown below:

oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-115-20-120.eu-west-2.compute.internal Ready worker 3h2m v1.26.9+636f2be
ip-10-115-20-24.eu-west-2.compute.internal NotReady,SchedulingDisabled control-plane,master 3h13m v1.26.9+636f2be
ip-10-115-20-24.ice-aws.cloud Ready control-plane,master 2m58s v1.27.6+b49f9d1
ip-10-115-21-158.eu-west-2.compute.internal Ready worker 3h4m v1.26.9+636f2be
ip-10-115-21-62.eu-west-2.compute.internal Ready control-plane,master 3h13m v1.26.9+636f2be
ip-10-115-22-124.eu-west-2.compute.internal Ready control-plane,master 3h13m v1.26.9+636f2be
ip-10-115-22-137.eu-west-2.compute.internal Ready worker 3h4m v1.26.9+636f2be

Version-Release number of selected component (if applicable):

4.14.0-0.okd-2023-11-12-042703

How reproducible:

100%

Steps to Reproduce:

1.  Create VPC using a DHCP option set with the domain-name value set to some custom.
2.  Install version 4.13.0-0.okd-2023-10-28-065448 customized IPI adjusting the install-config.yaml to use your custom VPC/subnets etc.
3.  Once install is complete, upgrade to 4.14.0-0.okd-2023-11-12-042703.  This can either be full upgrade or control-plane first.  Both fail.

Actual results:

Cluster upgrade won't go past 32 of 33 cluster operators upgraded.

Many of the cluster operators are either in a 'Degraded' state or 'Cannot update' state.

Expected results:

Cluster should upgrade to version 4.14.0-0.okd-2023-11-12-042703 successfully.

Additional info:

In install-config.yaml I have the following set:

networkType: OpenShiftSDN

and

publish: Internal

I'm also setting my subnets and machineNetwork to match my pre-created VPC.

In manifest file cluster-network-03-config.yaml I have the following set:

apiVersion: operator.openshift.io/v1

I also have the following set in the same file:

  networkType: OpenShiftSDN
  defaultNetwork:
    type: OpenShiftSDN
    openshiftSDNConfig:
      mode: Multitenant

In cluster-ingress-default-ingresscontroller.yaml is have the following set:

  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
      providerParameters:
        type: AWS
        aws:
          type: NLB

I have also registered this issue here https://github.com/okd-project/okd/issues/1921.

Attachments

Activity

People

Assignee:: Joel Speed

Reporter:: Bryn Ellis

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024/04/12 2:26 PM

Updated:: 2024/04/17 8:50 AM

Resolved:: 2024/04/16 7:01 AM