Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62075

[commatrix] test permafailing on aws serial jobs, and should be deleted for re-try in 4.21

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Approved
    • None
    • In Progress
    • Release Note Not Required
    • None
    • None
    • None
    • None
    • None

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [sig-network][Feature:commatrix][apigroup:config.openshift.io][Serial] generated communication matrix should be equal to documented communication matrix [Suite:openshift/conformance/serial]

      Test has a 72.22% pass rate, but 95.00% is required.

      Sample (being evaluated) Release: 4.20
      Start Time: 2025-09-12T00:00:00Z
      End Time: 2025-09-19T08:00:00Z
      Success Rate: 72.22%
      Successes: 13
      Failures: 5
      Flakes: 0
      Base (historical) Release: 4.19
      Start Time: 2025-05-18T00:00:00Z
      End Time: 2025-06-17T23:59:59Z
      Success Rate: 0.00%
      Successes: 0
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      After the chaos of port 10357 broke all jobs/payloads a few weeks ago due to this test failing, a mix of fixes in commatrix repo, possibly multiple vendors into origin, plus backports to 4.20, we discovered this morning that all serial jobs are failing on at least aws and payloads are blocked again on the exact same error.

      This appears to be because a PR was merged to commatrix, but a dependent origin 4.29 PR (which has also merged) but has not yet made it through the build system into a payload, we are back into a full fail scenario.

      The original requirements for a mix of commatrix plus vendoring had me concerned about the maintenance of this test, but discovering the test also hits raw URLs directly in commatrix repo is another clear indication this test needs a revamp and rethink. A test ensuring ports are documented should not be this difficult to maintain, nor should it be possible for it to block presubmits and payloads, effectively grinding the entire organization to a halt.

      At this point I believe we should delete the test, backport to 4.20, and try again in the 4.21 release. My recommendation would be to move this test to it's own suite, run it in it's own job on a schedule, perhaps daily, setup alerts to send to a slack channel when it begins failing, and it will also appear in component readiness.

      The confusion around commit to commatrix, vendor to origin, plus also hit URL should also be addressed. A better system would probably be to define the test in your own repo using a standalone OTE binary. (can seek help in #wg-openshift-test-extensions, reportedly HCM is already doing something like this) This would give you access to the utilities you need to complete the test and the ability to run and generate conformant junit xml in your own periodic. Fixes would require one PR only.

      Filed by: dgoodwin@redhat.com

              aabugosh amal abu gosh
              openshift-trt OpenShift Technical Release Team
              None
              None
              Yang Liu Yang Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: