Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-22200

Workers fail to join cluster if metadata service is temporarily unavailable on first boot

    XMLWordPrintable

Details

    • ?
    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously worker nodes on {rh-openstack-first} were named with domain components if the Nova metadata service was unavailable the first time the worker nodes booted. {product-title} expects the node names to be the same as the Nova instance. The name discrepancy caused the nodes' certificate request to be rejected and the nodes could not join the cluster. With this update, the worker nodes will wait and retry the metadata service indefinitely on first boot ensuring the nodes are correctly named. (link:https://issues.redhat.com/browse/OCPBUGS-22200[*OCPBUGS-22200*])

      Fixes a bug where workers could fail to join the cluster on creation if there was a temporary outage of the Nova metadata service.

      OpenShift expects OpenStack worker nodes to have the same name as the Nova instance they are on with no domain component. If the Nova metadata service was temporarily unavailable the first time the worker booted it would instead use the hostname, which by default includes a domain component. This would cause the node's certificate signing request to be rejected, and the node would not be able to join the cluster.

      With this fix the worker node will wait and retry indefinitely for a response from the metadata service on first boot, so it will always get the correct name.
      Show
      Previously worker nodes on {rh-openstack-first} were named with domain components if the Nova metadata service was unavailable the first time the worker nodes booted. {product-title} expects the node names to be the same as the Nova instance. The name discrepancy caused the nodes' certificate request to be rejected and the nodes could not join the cluster. With this update, the worker nodes will wait and retry the metadata service indefinitely on first boot ensuring the nodes are correctly named. (link: https://issues.redhat.com/browse/OCPBUGS-22200 [* OCPBUGS-22200 *]) Fixes a bug where workers could fail to join the cluster on creation if there was a temporary outage of the Nova metadata service. OpenShift expects OpenStack worker nodes to have the same name as the Nova instance they are on with no domain component. If the Nova metadata service was temporarily unavailable the first time the worker booted it would instead use the hostname, which by default includes a domain component. This would cause the node's certificate signing request to be rejected, and the node would not be able to join the cluster. With this fix the worker node will wait and retry indefinitely for a response from the metadata service on first boot, so it will always get the correct name.
    • Bug Fix
    • Proposed

    Description

      This was originally reported in AWS (details below), but the OpenStack configuration suffers the same issue. If the metadata query for the instance name fails on initial boot, kubelet will start with an invalid nodename and will fail to come up.

      Description of problem:

      worker CSR are pending, so no worker nodes available

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-2023-10-06-234925

      How reproducible:

      Always

      Steps to Reproduce:

      Create a cluster with profile - aws-c2s-ipi-disconnected-private-fips

      Actual results:

      Workers csrs are pending 

      Expected results:

      workers should be up and running all CSRs approved 

      Additional info:

      failed to find machine for node ip-10-143-1-120” , in logs of cluster-machine-approver 
      
      Seems like we should have ips like 
      “ip-10-143-1-120.ec2.internal”
      
      failing here - https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L263

       

      Must-gather - https://drive.google.com/file/d/15tz9TLdTXrH6bSBSfhlIJ1l_nzeFE1R3/view?usp=sharing

      cluster - https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/238922/

      template for installation - https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_14/ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-fips-c2s-ci

       

      cc yunjiang-1 rhn-support-zhsun 

      Attachments

        Issue Links

          Activity

            People

              rhn-gps-mbooth Matthew Booth
              rh-ee-miyadav Milind Yadav
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: