-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
Future Sustainability
-
False
-
-
False
-
None
-
None
-
None
-
None
User Story
As a managed service operator, I want periodic jobs that validate clusters can upgrade from CPO version 4.Y.0 to 4.Y.latest without triggering NodePool rollouts, so that I can plan customer upgrades safely from initial minor releases.
Acceptance Criteria
- Periodic jobs created for each supported OCP minor version (4.16+)
- Jobs run existing TestUpgradeControlPlane test with .0 → latest parameters
- Jobs run daily
- Tests validate MachineDeployment generation remains unchanged (no NodePool rollout)
- Tests validate control plane successfully upgrades
- Tests validate cluster remains functional post-upgrade
- Tests validate no crashing pods
- Failed upgrades provide detailed diagnostics (logs, events, resource states)
Technical Details
Existing Test Being Used
TestUpgradeControlPlane (test/e2e/control_plane_upgrade_test.go:16) already:
- Creates cluster with --e2e.previous-release-image
- Upgrades to --e2e.latest-release-image
- Validates MachineDeployment.Generation == 1 (no rollout)
- Validates EnsureNoCrashingPods
- Validates cluster functionality
We just need to configure periodic jobs to run this test with the right version parameters.
CI Operator Config
- Path: ci-operator/config/openshift/hypershift/openshift-hypershift-release-4.Y__periodics-hcm-upgrade.yaml
- Workflow: hypershift-aws-e2e-upgrade or create new workflow
- Environment variables:
- PREVIOUS_RELEASE_IMAGE: 4.Y.0 release pullspec
- LATEST_RELEASE_IMAGE: 4.Y.latest release pullspec
Periodic Job
- Name: periodic-ci-openshift-hypershift-release-4.Y-periodics-hcm-upgrade-dot-zero-to-latest-aws-ovn
- Interval: Daily (0 2 * * *)
- Timeout: 2 hours
- Target: Run TestUpgradeControlPlane
Test Execution
The job will:
1. Install HO (Konflux image from release-4.Y branch)
2. Create cluster with CPO 4.Y.0 (via PREVIOUS_RELEASE_IMAGE)
3. Wait for cluster ready
4. Trigger upgrade to CPO 4.Y.latest (via LATEST_RELEASE_IMAGE)
5. Validate upgrade completes without NodePool rollout
6. Cleanup
All validation logic is already in TestUpgradeControlPlane:
- Line 74: e2eutil.EnsureMachineDeploymentGeneration(t, ctx, mgtClient, hostedCluster, 1)
- Line 73: e2eutil.EnsureNoCrashingPods(t, ctx, mgtClient, hostedCluster)
- Line 72: e2eutil.EnsureNodeCountMatchesNodePoolReplicas(...)
Document Version Update Process (should really be automated)
When new Z-stream is released (e.g., 4.20.16):
1. Update LATEST_RELEASE_IMAGE in job configuration
2. PREVIOUS_RELEASE_IMAGE stays at .0
3. Submit PR to openshift/release
4. Run pj-rehearse to validate
5. Monitor for any new upgrade issues
Diagnostic Collection on Failure
TestUpgradeControlPlane already collects diagnostics via the test framework. Additional artifacts from periodic job:
- Job logs
- Cluster resources (captured by framework)
- Must-gather equivalent
Success Metrics
- <5% false failure rate
- Upgrade completes within 30 minutes
- All validation checks pass (built into test)
- Clear diagnostics available on failure
- Results visible in CI dashboards
Reference
- Existing test: test/e2e/control_plane_upgrade_test.go:16
- Test validates: test/e2e/util/util.go:1007 (EnsureMachineDeploymentGeneration)