[OCPBUGS-22200] Workers fail to join cluster if metadata service is temporarily unavailable on first boot - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.0
Affects Version/s: 4.14
Component/s: Cloud Compute / OpenStack Provider
Labels:
- Triaged

Test Coverage:

?
Severity:
Moderate
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Previously worker nodes on {rh-openstack-first} were named with domain components if the Nova metadata service was unavailable the first time the worker nodes booted. {product-title} expects the node names to be the same as the Nova instance. The name discrepancy caused the nodes' certificate request to be rejected and the nodes could not join the cluster. With this update, the worker nodes will wait and retry the metadata service indefinitely on first boot ensuring the nodes are correctly named. (link:https://issues.redhat.com/browse/OCPBUGS-22200[*~~OCPBUGS-22200~~*])

Fixes a bug where workers could fail to join the cluster on creation if there was a temporary outage of the Nova metadata service.

OpenShift expects OpenStack worker nodes to have the same name as the Nova instance they are on with no domain component. If the Nova metadata service was temporarily unavailable the first time the worker booted it would instead use the hostname, which by default includes a domain component. This would cause the node's certificate signing request to be rejected, and the node would not be able to join the cluster.

With this fix the worker node will wait and retry indefinitely for a response from the metadata service on first boot, so it will always get the correct name.

Show
Previously worker nodes on {rh-openstack-first} were named with domain components if the Nova metadata service was unavailable the first time the worker nodes booted. {product-title} expects the node names to be the same as the Nova instance. The name discrepancy caused the nodes' certificate request to be rejected and the nodes could not join the cluster. With this update, the worker nodes will wait and retry the metadata service indefinitely on first boot ensuring the nodes are correctly named. (link: https://issues.redhat.com/browse/OCPBUGS-22200 [* OCPBUGS-22200 *]) Fixes a bug where workers could fail to join the cluster on creation if there was a temporary outage of the Nova metadata service. OpenShift expects OpenStack worker nodes to have the same name as the Nova instance they are on with no domain component. If the Nova metadata service was temporarily unavailable the first time the worker booted it would instead use the hostname, which by default includes a domain component. This would cause the node's certificate signing request to be rejected, and the node would not be able to join the cluster. With this fix the worker node will wait and retry indefinitely for a response from the metadata service on first boot, so it will always get the correct name.
Release Note Type:
Bug Fix
Release Note Status:
Proposed
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This was originally reported in AWS (details below), but the OpenStack configuration suffers the same issue. If the metadata query for the instance name fails on initial boot, kubelet will start with an invalid nodename and will fail to come up.

Description of problem:

worker CSR are pending, so no worker nodes available

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-10-06-234925

How reproducible:

Always

Steps to Reproduce:

Create a cluster with profile - aws-c2s-ipi-disconnected-private-fips

Actual results:

Workers csrs are pending

Expected results:

workers should be up and running all CSRs approved

Additional info:

failed to find machine for node ip-10-143-1-120” , in logs of cluster-machine-approver 

Seems like we should have ips like 
“ip-10-143-1-120.ec2.internal”

failing here - https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L263

Must-gather - https://drive.google.com/file/d/15tz9TLdTXrH6bSBSfhlIJ1l_nzeFE1R3/view?usp=sharing

cluster - https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/238922/

template for installation - https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_14/ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-fips-c2s-ci

cc yunjiang-1 rhn-support-zhsun