Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-2698

Migrate HyperShift to AWS SDK for Go v2

XMLWordPrintable

    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • 7
    • None
    • None
    • None
    • None
    • None
    • None

      Migrate HyperShift's AWS SDK dependencies from v1 to v2 to ensure continued security support, vendor maintenance, and compliance before the July 31, 2025 end-of-life deadline.

      Market Problem

      AWS is ending support for AWS SDK for Go v1 on July 31, 2025. After this date, SDK v1 will no longer receive:

      • Security patches for discovered vulnerabilities
      • Bug fixes for defects
      • Updates for new AWS services or features
      • Vendor support from AWS

      HyperShift currently depends on AWS SDK v1 across 36 production code files representing approximately ~10,000 lines of AWS-dependent code. This creates critical risks:

      Who is affected:

      • ROSA HCP (Red Hat OpenShift Service on AWS - Hosted Control Planes) customers
      • ARO HCP (Azure Red Hat OpenShift - Hosted Control Planes) customers using AWS dependencies
      • Self-managed HyperShift deployments on AWS

      Impact of not migrating:

      • Security risk:* Post EOL vulnerabilities in AWS SDK v1
      • Compliance risk:* Security audits will flag unsupported dependencies
      • Technical debt:* Migration becomes increasingly costly as codebase evolves
      • Support burden:* AWS will not address SDK v1 issues after EOL

      This is a time-critical engineering investment required to maintain security posture and vendor support for all AWS-based hosted control plane offerings.

      Proposed Solution

      Complete migration of HyperShift's AWS SDK usage from v1 to v2 before the July 31, 2025 EOL deadline, including:

      Core Migration:

      • Migrate all 36 production code files using AWS SDK v1 to v2 equivalents
      • Update configuration patterns from session-based to context-based
      • Modernize service clients to use v2 modular design
      • Refactor error handling to use v2 error types (smithy.APIError)
      • Update credential providers to v2 interfaces
      • Migrate paginators and waiters to v2 explicit constructors
      • Update testing infrastructure and mocks to v2 patterns

      Dependency Updates:

      • Update Cluster API Provider AWS (CAPA) to v2-compatible version
      • Verify controller-runtime AWS SDK compatibility

      Quality Assurance:

      • Comprehensive unit test updates across all affected components
      • Full E2E test validation on AWS platform
      • Performance benchmarking to ensure no regressions
      • Security scanning of new SDK version

      Strategic Value

      Customer Value

      • Security assurance:* Continued security patching protects customer infrastructure
      • Compliance:* Meets security audit requirements for supported dependencies
      • Reliability:* Access to AWS bug fixes maintains cluster stability
      • Innovation:* Enables future use of new AWS services and features

      Business Impact

      • Risk mitigation:* Prevents security incidents from unpatched SDK vulnerabilities
      • Compliance:* Maintains SOC2, ISO27001, FedRAMP compliance postures
      • Support efficiency:* AWS support available for SDK-related issues
      • Cost avoidance:* Prevents emergency migration costs if security issue discovered post-EOL

      Success Criteria

      Completion

      • Zero AWS SDK v1 imports remain in HyperShift codebase
      • All CAPA dependencies updated to v2-compatible versions
      • Migration completed and shipped before July 31, 2025

      Quality

      • 100% unit test pass rate with AWS SDK v2
      • 100% E2E test pass rate for AWS platform (ROSA HCP, standalone)
      • Zero performance regression in AWS operations (EC2, S3, Route53, etc.)
      • Zero functional regressions reported in QE or production

      Adoption

      • Target OpenShift release (4.20 or earlier) ships with migrated code
      • All customer deployments receive migrated version through normal upgrade path
      • Documentation updated with v2 usage patterns

      Security

      • Security scan shows zero critical/high vulnerabilities in AWS SDK dependency
      • Compliance teams confirm supported dependency status
      • No security incidents related to AWS SDK dependencies post-migration

      Scope

      Affected Components (36 files)

      Critical Production Components:

      1. CLI Infrastructure Management (cmd/infra/aws/)

      • Infrastructure creation/destruction workflows
      • AWS services: EC2, Route53, S3, IAM, ELB/ELBv2, RAM, STS
      • Components: VPC/subnet/security group creation, NAT gateways, DNS zone management, IAM role/policy provisioning, load balancer setup
      • Impact: Core hypershift CLI commands for AWS infrastructure provisioning

      2. HyperShift Operator - AWS Platform Controller

      • hypershift-operator/controllers/platform/aws/ - AWS Endpoint Service management
      • hypershift-operator/controllers/hostedcluster/ - S3 OIDC bucket operations
      • hypershift-operator/controllers/nodepool/ - EC2 subnet validation and instance metrics
      • AWS services: EC2, S3, ELBv2
      • Impact: Core reconciliation logic for AWS-hosted clusters (ROSA HCP, standalone AWS)

      3. Control Plane Operator - Private Link Controller

      • control-plane-operator/controllers/awsprivatelink/ - VPC Endpoint Service management
      • control-plane-operator/controllers/awsprivatelink/route53.go - Private DNS integration
      • AWS services: EC2, Route53, KMS
      • Impact: Critical for Private cluster functionality - affects PrivateLink-based customer deployments

      4. Shared AWS Utilities (support/awsutil/)

      • STS role assumption with web identity federation
      • AWS error code handling and retry logic
      • Security group operations
      • Impact: Foundation utilities used across all AWS components

      5. Karpenter Operator - Machine Approver

      • karpenter-operator/controllers/karpenter/machine_approver.go
      • EC2 instance verification for auto-scaling
      • Impact: NodePool auto-scaling with Karpenter

      6. Etcd Backup Service

      • etcd-backup/etcdbackup.go - S3 snapshot uploads
      • Impact: Cluster backup and disaster recovery

      7. CLI Utilities

      • Bastion host creation/destruction
      • Console log retrieval from EC2 instances
      • Impact: Debugging and troubleshooting workflows

      8. E2E Test Infrastructure (9 files)

      • AWS resource creation/cleanup for tests
      • Shared OIDC provider setup
      • KMS encryption validation, tag propagation testing
      • Impact: CI/CD test reliability and coverage

      9. Contrib Tools

      • Zone cleanup utilities
      • Impact: Operations and maintenance scripts

      AWS Services Utilized (7 services)

      • EC2 (16 files) - Instance management, VPC/networking, security groups, volumes
      • Route53 (6 files) - DNS zones, record sets, private hosted zones
      • S3 (5 files) - OIDC provider buckets, etcd backup storage, multipart uploads
      • IAM (4 files) - Roles, policies, OIDC identity providers
      • ELBv2 (4 files) - Application/Network Load Balancers, Endpoint Services
      • ELB Classic (2 files) - Legacy load balancer operations
      • Additional: KMS (encryption), RAM (shared VPC), STS (role assumption), Resource Groups Tagging API

      Technical Patterns that might require migration

      • Session Management:* aws/session package usage across all components
      • Credentials:* Static credentials, STS web identity, role assumption patterns
      • Error Handling:* awserr package for error code extraction and retry logic
      • Testing Interfaces: Heavy use of _iface packages (ec2iface, s3iface, etc.) for mocking
      • Request Customization:* Custom user agent headers ("openshift.io/hypershift"), retry handlers
      • S3 Multipart:* s3/s3manager for efficient large file uploads

      Epics (Planned)

      Phase 1: Test Infrastructure & CLI tools

      • Epic 1: Migrate E2E test infrastructure to validate patterns and create examples
      • Epic 2: Migrate CLI infrastructure commands (cmd/infra/aws/) - largest code surface

      Phase 2: Operators & supporting components

      • Epic 3: Migrate HyperShift operator AWS controllers
      • Epic 4: Migrate Control Plane operator AWS controllers (PrivateLink, Route53)
      • Epic 5: etcd-backup, contrib tools, aws-encryption-provider

      Phase 3: Docs

      • Epic 6: Update dev documentation, migration guides, developer onboarding materials

      Out of Scope

      • Feature additions beyond migration requirements
      • Performance optimizations not directly related to SDK migration
      • Changes to non-AWS platform support (Azure, PowerVS, KubeVirt, OpenStack, Agent)
      • Refactoring of AWS infrastructure management patterns (keep existing patterns, just update SDK)

      Dependencies

       

      Dependency SDKv1 SDKv2 Status Impact
      CAPA ❌ None ✅ Fully migrated ✅ Complete No blocker - can use as reference
      Karpenter ❌ None ✅ Fully migrated ✅ Complete No blocker - can use as reference
      controller-runtime ❌ None ❌ None ✅ N/A No impact - cloud-agnostic
      HyperShift ⚠️ v1.55.7 Indirect only 🔄 To migrate Primary work needed

       

      Hard Deadline:

      • July 31, 2025: AWS SDK v1 end-of-life - MUST ship before this date

      References

              rhn-support-yli2 Yu Li
              asegurap1@redhat.com Antoni Segura Puimedon
              None
              None
              Salvatore Dario Minonne Salvatore Dario Minonne
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: