Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29252

OS Provisioning Timeout Getting Azure Instance Metadata

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • 4.16.0
    • 4.13, 4.12, 4.11, 4.14, 4.15, 4.16
    • RHCOS
    • None
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause (the user action or circumstances that trigger the bug):
      Azure Instance MetaData Service (IMDS) started returning 410 HTTP status code for meta data request in at a higher frequency than it used too.

      Consequence (what the user experience is when the bug occurs):
      When receiving an error (<500) HTTP status code, Ignition stops retrying fetching the userdata, which leads to RHCOS node boot failures and OCP cluster installation failures.

      Fix (what has changed to fix the bug; do not include overly technical details):
      Ignition now ignores some HTTP status code and retries fetching the Ignition config from the metadata server.

      Result (what happens now that the patch is applied):
      Ignition thus waits for the metadata to be available and then the RHCOS nodes boots and cluster installation succeeds.
      Show
      Cause (the user action or circumstances that trigger the bug): Azure Instance MetaData Service (IMDS) started returning 410 HTTP status code for meta data request in at a higher frequency than it used too. Consequence (what the user experience is when the bug occurs): When receiving an error (<500) HTTP status code, Ignition stops retrying fetching the userdata, which leads to RHCOS node boot failures and OCP cluster installation failures. Fix (what has changed to fix the bug; do not include overly technical details): Ignition now ignores some HTTP status code and retries fetching the Ignition config from the metadata server. Result (what happens now that the patch is applied): Ignition thus waits for the metadata to be available and then the RHCOS nodes boots and cluster installation succeeds.
    • Bug Fix
    • In Progress

      Description of problem: VMs are receiving `410 Gone` errors and failing to provision.

      According to Microsoft's recommendations, and their direct recommendation during our outage bridge call, the call must be retried after 70s to succeed

          Version-Release number of selected component (if applicable): 4.12
      
          
          Steps to Reproduce:{code:none}
          1. Attempt to provision an ARO cluster in either "eastus", "australiaeast", "japaneast", "uswest"
          2. Monitor node provisioning for 410 Gone errors
          3. Node(s) should fail to provision
          

      Actual results:

          In jmilhau-test3: master-0 failed to download ignition after it receives a 410 on the second attempt (extracted from serial logs. Full serial logs here): 
      Feb 08 12:36:15 ignition[1013]: GET error: Get "http://169.254.169.254/metadata/instance/compute/userData?api-version=2021-01-01&format=text": dial tcp 169.254.169.254:80: connect: network is unreachable 
      Feb 08 12:36:15 ignition[1013]: GET http://169.254.169.254/metadata/instance/compute/userData?api-version=2021-01-01&format=text: attempt #2 
      Feb 08 12:36:15 ignition[1013]: GET result: Gone master-1 is able to GET the same resource after the 3rd attempt (Full serial logs): [ 6.644027] ignition[979]: GET http://169.254.169.254/metadata/instance/compute/userData?api-version=2021-01-01&format=text: attempt #3 [ 6.729304] ignition[979]: GET result: OK MSFT pointed out to their docs where they specify that after receiving a 410, the request can be retried after 70s: Azure Instance Metadata Service for virtual machines - Azure Virtual Machines | Microsoft Learn They insisted that even if 410 code in the standard HTTP specs says no retry, we should/must retry for this specific use case Ignition service however stops retrying after receiving a "410: Gone" error, in line with HTTP specs (it retries on other errors).

      Expected results:

          Node OSs to provision successfully.

      Additional info:

      
      
          

      forum-rhel-coreos

            travier@redhat.com Timothée Ravier
            rh-ee-sfairchi Steven Fairchild
            Michael Nguyen Michael Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: