Uploaded image for project: 'Red Hat Internal Developer Platform'
  1. Red Hat Internal Developer Platform
  2. RHIDP-12290

[QE] RHDH CI Pipeline Optimization: Reduce PR Feedback from ~31 min to <15 min

    • RHDH CI Optimization
    • False
    • Hide

      None

      Show
      None
    • False
    • To Do
    • QE Needed, Docs Needed, TE Needed, Customer Facing, PX Needed
    • 88% To Do, 13% In Progress, 0% Done

      EPIC Goal

      Reduce RHDH PR presubmit CI feedback time from ~31 minutes to <15 minutes (P50) by optimizing the existing OCP Helm deploy pipeline. PR checks continue using full OCP deployment with RHDH build — all optimizations target pipeline efficiency.

      Background

      Evidence from Prow run Build ID 2023822693903634432 (job: pull-ci-redhat-developer-rhdh-main-e2e-ocp-helm):

      Phase Duration % of Total
      Pre 25s 1%
      Test (deploy + Playwright) 20m 06s 65%
      Post (gather-extra + must-gather) 8m 26s 27%
      Playwright only (showcase + showcase-rbac) ~10m 32%
      Total ~30m 59s 100%

      Key findings:

      • 8m 26s on post-phase artifact collection even on success (gather-extra: 6m, must-gather: 1m43s)
      • ~5 min waiting for Backstage readiness after Helm deploy (HTTP 503 loop)
      • ~1-2 min installing OpenShift Pipelines operator that is only needed for Tekton nightly tests, not PR
      • ~10 min running Playwright suites sequentially (showcase + showcase-rbac in separate namespaces)

      Why is this important?

      • Developer velocity: 31-min feedback loops slow iteration
      • CI resource waste: cluster time, compute, and operator installs repeated unnecessarily
      • Flake impact: no structured quarantine process, failures waste additional re-run time
      • Test pyramid is inverted: ~80% E2E, ~15% integration, ~5% unit — no coverage metrics

      Target Outcomes

      Metric Current Target
      PR feedback P50 ~31 min <15 min
      PR feedback P90 ~45 min <20 min
      Post-phase (pass) 8m 26s <1 min
      Nightly pass rate TBD >90%
      Flake rate TBD <5%

      Approach

      Three phases:

      1. Quick Wins (Weeks 1-4): Conditional gather on failure only, skip Pipelines operator for PR, Playwright parallelism tuning, flake quarantine mechanism
      2. Medium-term (Weeks 5-10): Pre-warmed cluster pools with operators pre-installed, parallel deployment and testing of showcase + showcase-rbac
      3. Strategic (Weeks 11-20): Optional test impact selection, coverage pipeline with ReportPortal/Codecov integration

      Acceptance Criteria

      PR feedback time reduced to <15 min at P50
      Post-phase on success completes in <1 min
      Flake quarantine mechanism operational
      Pre-warmed cluster pool operational for at least one pool
      Coverage pipeline integrated with ReportPortal

      References

      • Prow run: Build ID 2023822693903634432
      • Prow build log: artifacts/e2e-ocp-helm/build-log.txt
      • Prow gather-extra log: artifacts/e2e-ocp-helm/gather-extra/build-log.txt
      • Coverage Metrics Jira: RHDHPLAN-851

              Unassigned Unassigned
              gustavolira Gustavo Lira Silva
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: