Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45184

Shared VPC: AWS client fails to assume role when token creation is delayed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.17.z, 4.18
    • HyperShift
    • Important
    • None
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      *Cause*: Creating a hosted cluster in a shared vpc
      *Consequence*: In some cases, the private link controller fails to assume the shared vpc role to manage vpc endpoints in the shared vpc.
      *Fix*: Ensure that a client is created for every reconciliation in the private link controller so that we can recover from invalid clients.
      *Result*: The hosted cluster endpoints can reliably be created and the hosted cluster comes up.
      Show
      *Cause*: Creating a hosted cluster in a shared vpc *Consequence*: In some cases, the private link controller fails to assume the shared vpc role to manage vpc endpoints in the shared vpc. *Fix*: Ensure that a client is created for every reconciliation in the private link controller so that we can recover from invalid clients. *Result*: The hosted cluster endpoints can reliably be created and the hosted cluster comes up.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-44698. The following is the description of the original issue:

      Description of problem:

          In integration, creating a rosa HostedCluster with a shared vpc will result in a VPC endpoint that is not available.

      Version-Release number of selected component (if applicable):

          4.17.3

      How reproducible:

          Sometimes (currently every time in integration, but could be due to timing)

      Steps to Reproduce:

          1. Create a HostedCluster with shared VPC
          2. Wait for HostedCluster to come up
          

      Actual results:

      VPC endpoint never gets created due to errors like:
      {"level":"error","ts":"2024-11-18T20:37:51Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","AWSEndpointService":{"name":"private-router","namespace":"ocm-int-2f4labdgi2grpumbq5ufdsfv7nv9ro4g-cse2etests-gdb"},"namespace":"ocm-int-2f4labdgi2grpumbq5ufdsfv7nv9ro4g-cse2etests-gdb","name":"private-router","reconcileID":"bc5d8a6c-c9ad-4fc8-8ead-6b6c161db097","error":"failed to create vpc endpoint: UnauthorizedOperation","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222"}
          

      Expected results:

          VPC endpoint gets created

      Additional info:

          Deleting the control plane operator pod will get things working. 
      The theory is that if the control plane operator pod is delayed in obtaining a web identity token, then the client will not assume the role that was passed to it.
      
      Currently the client is only created once at the start, we should create it on every reconcile.

              cewong@redhat.com Cesar Wong
              openshift-crt-jira-prow OpenShift Prow Bot
              Ohad Aharoni Ohad Aharoni
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: