Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62985

CAPI AWSMachine creation race condition leads to 2 EC2 instances for 1 machine

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      CAPI AWSMachine controller can create 2 EC2 instances for one machine. Only one instance id will be added to the Machine object properly which leads to leaking the instance (additional costs of running an unmanaged EC2 instance).
          

      Version-Release number of selected component (if applicable):

      Observed in 4.17.39, but the current upstream main branch doesn't seem to have a fix.
          

      How reproducible:

      rarely
          

      Steps to Reproduce:

          1. Create an instance for a Machine
          2. Fail for a while due to missing DescribeInstanceTypes permission (might be a contributing factor to triggering the race condition, unknown if necessary, after a few retries the creation continues without it)
          3. Create tags from resource without errors [log message in github | https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/46ee041ed5c5b826d55d0ac4acb430c3965e6035/pkg/cloud/services/ec2/instances.go#L306 ]
          4. Requeue an instance in [this line of the AWSMachine controller (github) | https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/46ee041ed5c5b826d55d0ac4acb430c3965e6035/controllers/awsmachine_controller.go#L666 ] before tags are applied in AWS.
          5. The next loop of the controller will attempt to find the EC2 instance ("Looking for existing machine instance by tags") and after not finding any will create a new EC2 instance.
      
      I'm not sure how to trigger a delay in tag application in AWS to test the solution, but there aren't any guarantees on new tags being available to read commands after the update call in the AWS documentation.
          

      Actual results:

      Handling the Machine is requeued before tags are confirmed to be present.
          

      Expected results:

      There's a verification if the tags are visible. Tags are read in the same way instances.go is "Looking for existing machine instance by tags" before requeuing. 
          

      Additional info:

      
          

              rh-ee-cschlott Christian Schlotter
              ljakubow2.openshift Leszek Jakubowski
              None
              None
              Huali Liu Huali Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: