[OCPBUGS-29252] OS Provisioning Timeout Getting Azure Instance Metadata - Red Hat Issue Tracker

Type: Bug
Resolution: Obsolete
Priority: Critical
Fix Version/s: 4.16.0
Affects Version/s: 4.13, 4.12, 4.11, 4.14, 4.15, 4.16
Component/s: RHCOS
Labels:
None

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Cause (the user action or circumstances that trigger the bug):
Azure Instance MetaData Service (IMDS) started returning 410 HTTP status code for meta data request in at a higher frequency than it used too.

Consequence (what the user experience is when the bug occurs):
When receiving an error (<500) HTTP status code, Ignition stops retrying fetching the userdata, which leads to RHCOS node boot failures and OCP cluster installation failures.

Fix (what has changed to fix the bug; do not include overly technical details):
Ignition now ignores some HTTP status code and retries fetching the Ignition config from the metadata server.

Result (what happens now that the patch is applied):
Ignition thus waits for the metadata to be available and then the RHCOS nodes boots and cluster installation succeeds.

Show
Cause (the user action or circumstances that trigger the bug): Azure Instance MetaData Service (IMDS) started returning 410 HTTP status code for meta data request in at a higher frequency than it used too. Consequence (what the user experience is when the bug occurs): When receiving an error (<500) HTTP status code, Ignition stops retrying fetching the userdata, which leads to RHCOS node boot failures and OCP cluster installation failures. Fix (what has changed to fix the bug; do not include overly technical details): Ignition now ignores some HTTP status code and retries fetching the Ignition config from the metadata server. Result (what happens now that the patch is applied): Ignition thus waits for the metadata to be available and then the RHCOS nodes boots and cluster installation succeeds.
Release Note Type:
Bug Fix
Release Note Status:
In Progress
Target Backport Versions:

4.13.z, 4.12.z, 4.14.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem: VMs are receiving `410 Gone` errors and failing to provision.

According to Microsoft's recommendations, and their direct recommendation during our outage bridge call, the call must be retried after 70s to succeed

    Version-Release number of selected component (if applicable): 4.12

    Steps to Reproduce:{code:none}
    1. Attempt to provision an ARO cluster in either "eastus", "australiaeast", "japaneast", "uswest"
    2. Monitor node provisioning for 410 Gone errors
    3. Node(s) should fail to provision

Actual results:

    In jmilhau-test3: master-0 failed to download ignition after it receives a 410 on the second attempt (extracted from serial logs. Full serial logs here): 
Feb 08 12:36:15 ignition[1013]: GET error: Get "http://169.254.169.254/metadata/instance/compute/userData?api-version=2021-01-01&format=text": dial tcp 169.254.169.254:80: connect: network is unreachable 
Feb 08 12:36:15 ignition[1013]: GET http://169.254.169.254/metadata/instance/compute/userData?api-version=2021-01-01&format=text: attempt #2 
Feb 08 12:36:15 ignition[1013]: GET result: Gone master-1 is able to GET the same resource after the 3rd attempt (Full serial logs): [ 6.644027] ignition[979]: GET http://169.254.169.254/metadata/instance/compute/userData?api-version=2021-01-01&format=text: attempt #3 [ 6.729304] ignition[979]: GET result: OK MSFT pointed out to their docs where they specify that after receiving a 410, the request can be retried after 70s: Azure Instance Metadata Service for virtual machines - Azure Virtual Machines | Microsoft Learn They insisted that even if 410 code in the standard HTTP specs says no retry, we should/must retry for this specific use case Ignition service however stops retrying after receiving a "410: Gone" error, in line with HTTP specs (it retries on other errors).

Expected results:

    Node OSs to provision successfully.

Additional info:

forum-rhel-coreos

is cloned by

RHEL-24950 OS Provisioning Timeout Getting Azure Instance Metadata

Closed

relates to

OCPBUGS-29441 [4.16] Bootimage bump tracker

Closed

OCPBUGS-29442 [4.15] Bootimage bump tracker

Closed

OCPBUGS-29626 [4.14] Bootimage bump tracker

Closed

OCPBUGS-29627 [4.13] Bootimage bump tracker

Closed

OCPBUGS-30768 [4.12] Bootimage bump tracker

Closed

(1 relates to)

Assignee:: Timothée Ravier

Reporter:: Steven Fairchild

QA Contact:: Michael Nguyen

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2024/02/08 4:00 PM

Updated:: 2024/09/10 7:39 PM

Resolved:: 2024/09/10 7:39 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates