Uploaded image for project: 'OpenShift Windows Containers'
  1. OpenShift Windows Containers
  2. WINC-1508

Draft: Optimize BYOH tests to use pre-provisioned nodes in Prow CI for faster execution

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • 14
    • None
    • None
    • None

      Story Summary

      Enable BYOH E2E tests to run in Prow CI by implementing pre-provisioned node support and parallel test execution.

      Current State: BYOH tests cannot run in Prow CI due to excessive runtime (~200 minutes) and resource constraints.

      Target State: BYOH tests run in Prow CI in ~40 minutes using pre-provisioned infrastructure with partial parallelization.

      Problem Statement

      BYOH tests are currently blocked from Prow CI migration because:

      Runtime Issues:

      • Each test provisions Windows VMs for 15+ minutes via MachineSet creation
      • 5 tests running serially = ~100 minutes total runtime
      • Exceeds Prow job time limits and resource quotas

      Infrastructure Issues:

      • Tests are marked [Disruptive] requiring serial execution
      • Node provisioning and destruction creates resource churn
      • No mechanism to reuse nodes between test runs

      Impact:

      • BYOH tests remain in legacy Jenkins infrastructure
      • No automated CI coverage for BYOH scenarios
      • Slower feedback loop for Windows Container development

      Solution Approach

      Architecture Changes

      Pre-Provisioned Infrastructure:

      • Terraform provisions 2 Windows nodes before test execution
      • Nodes are registered in windows-instances ConfigMap
      • Tests detect and use existing nodes instead of provisioning new ones

      Framework Enhancements:

      • extractAddressesFromConfigMap() - Retrieve pre-provisioned node addresses
      • Refactored setBYOH() - Dual-mode execution (pre-provisioned vs on-demand)
      • remediateBYOHNodes() - Fast cleanup without node destruction

      Execution Strategy:

      • 1 test runs in parallel (OCP-42484)
      • 3 tests run serially (OCP-42496, OCP-44099, OCP-82694)
      • Total runtime: ~20 minutes

      Implementation Details

      Modified Components

      Repository: openshift-tests-private
      Primary File: test/extended/winc/utils.go

      New Functions

        // Extract addresses from pre-provisioned ConfigMap
        func extractAddressesFromConfigMap(oc *exutil.CLI) ([]string, error)
      
        // Clean nodes between test runs without deconfiguration
        func remediateBYOHNodes(oc *exutil.CLI, addresses []string, privateKey string, iaasPlatform string)
        

      Refactored Functions

        // Now supports dual-mode execution
        func setBYOH(oc *exutil.CLI, iaasPlatform string, addressesType []string, machinesetName string, winVersion string) []string {
          // Check if windows-instances ConfigMap exists
          if configMapExists() {
            // Pre-Provisioned Mode (Prow CI)
            addresses := extractAddressesFromConfigMap(oc)
            waitWindowsNodeReady(oc, nodeName, 11*time.Minute)
            return addresses
          } else {
            // Provisioning Mode (Local/Jenkins)
            configureMachineset(...)
            waitForMachinesetReady(...)
            return addresses
          }
        }
        

      ConfigMap Structure

        # Created by Terraform in Prow CI
        apiVersion: v1
        kind: ConfigMap
        metadata:     name: windows-instances
          namespace: openshift-windows-machine-config-operator
        data:     "10.0.1.100": "username=Administrator"
          "10.0.1.101": "username=Administrator"
          "10.0.1.102": "username=Administrator"
        

      Test Analysis

      Affected Tests

      Test ID Name Parallel? Runtime Notes
      OCP-42484 BYOH Configure with IP YES 5 min Isolated namespace, parallel-safe
      OCP-42496 BYOH Deconfiguration NO 5 min Deletes ConfigMap mid-test
      OCP-44099 SSH Key Rotation NO 5 min Modifies cluster-wide secret
      OCP-82694 Container Image Mirroring NO 5 min Creates cluster-wide IDMS
      OCP-42516 BYOH IP+DNS Dual Addressing N/A N/A Candidate for deprecation (redundant)

      Test Execution Flow

      Pre-Provisioned Mode (Prow CI):

        Terraform: Provision 2 Windows nodes → 15 minutes (one-time)
          ↓
        Parallel Phase:
          - OCP-42484 (node 1) → 5 minutes
          ↓
        Serial Phase:
          - OCP-42496 (node 2) → 5 minutes
          - OCP-44099 (node 2) → 5 minutes (after remediation)
          - OCP-82694 (node 3) → 5 minutes
      
        Total: ~35 minutes (15 min provision + 20 min tests)
        Per-run: ~20 minutes (provisioning amortized across runs)
        

      Provisioning Mode (Local/Jenkins):

        Each test provisions its own node:
          - OCP-42484: 15 min provision + 5 min test
          - OCP-42496: 15 min provision + 5 min test
          - OCP-44099: 15 min provision + 5 min test
          - OCP-82694: 15 min provision + 5 min test
      
        Total: ~80 minutes (sequential)
        

      Acceptance Criteria

      Framework Functionality

      • [ ] extractAddressesFromConfigMap() retrieves all node addresses from ConfigMap
      • [ ] setBYOH() detects pre-provisioned nodes when ConfigMap exists
      • [ ] setBYOH() falls back to MachineSet provisioning when ConfigMap doesn't exist
      • [ ] setBYOH() fails immediately with clear error if ConfigMap exists but is empty
      • [ ] remediateBYOHNodes() completes cleanup in <1 minute
      • [ ] ConfigMap is preserved in pre-provisioned mode (nodes stay configured)
      • [ ] ConfigMap is deleted in provisioning mode (nodes get deconfigured)

      Performance Targets

      • [ ] Pre-provisioned node wait timeout: 11 minutes (reduced from 15)
      • [ ] Total test suite runtime in Prow CI: ≤20 minutes
      • [ ] Provisioning time per test in Prow: <1 minute
      • [ ] Node remediation time: <1 minute

      Parallel Execution

      • [ ] OCP-42484 can run in parallel with other compatible tests
      • [ ] Node assignment prevents conflicts between parallel tests
      • [ ] Remediation is node-isolated (doesn't affect other tests)

      Backward Compatibility

      • [ ] All BYOH tests run successfully in local environments (provisioning mode)
      • [ ] No changes required to existing test code in initial implementation
      • [ ] Clear logging indicates execution mode (Pre-Provisioned vs Provisioning)

      Prow CI Integration

      • [ ] BYOH tests execute successfully in Prow CI environment
      • [ ] Tests use pre-provisioned infrastructure correctly
      • [ ] Runtime is within Prow CI acceptable limits
      • [ ] Full Windows Containers test suite migrated to Prow CI

      Dependencies

      Infrastructure

      Terraform Provisioner:

      • PR #6
      • Provisions Windows nodes before test execution
      • Creates windows-instances ConfigMap with node addresses

      Prow Configuration:

      • PR #71002
      • Defines Prow job for BYOH tests
      • Integrates Terraform provisioning step

      Parent Epic

      Technical Notes

      Detection Mechanism:

      • ConfigMap-based only (no environment variables)
      • Presence of windows-instances ConfigMap = pre-provisioned mode
      • Absence of ConfigMap = provisioning mode

      Timeout Adjustments:

      • Pre-provisioned nodes: 11 minutes (nodes already exist)
      • Provisioned nodes: 15 minutes (includes VM creation)

      Critical Behavior:

      • Deleting windows-instances ConfigMap triggers node deconfiguration
      • In pre-provisioned mode, ConfigMap must be preserved for node reuse
      • In provisioning mode, ConfigMap deletion is part of cleanup

      Parallelization Limitations:

      • 3 tests cannot run in parallel due to cluster-wide resource modifications
      • Future work: Refactor these tests to enable full parallelization

      Out of Scope

      This Story Does NOT Include:

      • Test code modifications (tests remain unchanged initially)
      • Creating new Prow job definitions (separate infrastructure story)
      • Refactoring non-parallel-safe tests (future enhancement)
      • Deprecating OCP-42516 (separate cleanup story)

      Performance Impact

      Metric Before (Serial) After (Prow CI) Improvement
      Provisioning per Test 15 min <1 min 93% reduction
      Total Runtime ~100 min ~20 min 80% reduction
      Parallel Tests 0/5 (0%) 1/4 (25%) Partial parallelization
      Node Wait Timeout 15 min 11 min 4 min faster
      Cleanup Time 2 min <1 min 50% faster
      Prow CI Ready Blocked Enabled Migration unblocked

      Testing Strategy

      Local Verification

      • [ ] Run all BYOH tests locally (provisioning mode)
      • [ ] Verify backward compatibility
      • [ ] Confirm no regressions in existing functionality

      Prow CI Verification

      • [ ] Simulate pre-provisioned environment locally
      • [ ] Verify ConfigMap detection works correctly
      • [ ] Confirm node remediation leaves nodes in clean state
      • [ ] Test parallel execution of OCP-42484
      • [ ] Validate serial execution of remaining tests

      Integration Testing

      • [ ] End-to-end Prow CI job execution
      • [ ] Terraform provisioning → test execution → cleanup
      • [ ] Multi-run validation (verify node reuse works)

      Definition of Done

      • [ ] Code changes merged to openshift-tests-private
      • [ ] All acceptance criteria met
      • [ ] BYOH tests running successfully in Prow CI
      • [ ] Runtime within acceptable limits (~20 minutes)
      • [ ] Documentation updated (inline code comments)
      • [ ] Local execution still works (backward compatible)
      • [ ] Full Windows Containers test suite migrated to Prow CI

      Follow-up Work

      Future Enhancements:

      • Refactor OCP-42496, OCP-44099, OCP-82694 for parallel execution
      • Evaluate deprecation of OCP-42516 (redundant test)
      • Optimize remediation for <30 second cleanup
      • Enable full 100% parallelization (all 4 tests concurrent)

              rhn-support-weinliu Weinan Liu
              rrasouli Aharon Rasouli
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: