-
Story
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
Future Sustainability
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
Our current aggregation logic is too sensitive, leading to the rejection of payloads for non-regressions. Analysis shows that a significant portion of rejected payloads are failing due to infrastructure noise or existing flakes rather than genuine code regressions.
I suggest we try using a "pity factor" and setting the minimum to 3 across the board. We have component readiness as a backstop to identify regressions with greater sample sizes.
Analysis shows 15% of aggregated jobs are failing with only 2 failures on all tests. This is a simple change we can implement immediately while we try to figure out alternatives to achieve our goal that our blocking CI jobs indicate the payload is likely good enough; and shift more subtle regression detection to CR.
Aggregation should detect significant problems;
Results:
total_job_runs_analyzed | job_runs_meeting_criteria | job_runs_not_meeting_criteria | percentage_meeting_criteria
-------------------------+---------------------------+-------------------------------+-----------------------------
2359 | 344 | 2015 | 14.58
Query:
SELECT
sub.total_job_runs_analyzed,
sub.job_runs_meeting_criteria,
sub.job_runs_not_meeting_criteria,
sub.percentage_meeting_criteria
FROM (
WITH RelevantFailedTests AS (
SELECT
pjr.id AS prow_job_run_id,
pj.name AS job_name,
t.name AS test_name,
pjrt_out.output AS test_output,
pjrt.id AS prow_job_run_test_id
FROM prow_job_runs pjr
JOIN prow_jobs pj ON pjr.prow_job_id = pj.id
JOIN prow_job_run_tests pjrt ON pjr.id = pjrt.prow_job_run_id
JOIN tests t ON pjrt.test_id = t.id
LEFT JOIN prow_job_run_test_outputs pjrt_out ON pjrt.id = pjrt_out.prow_job_run_test_id
WHERE pj.name LIKE 'aggregated-%'
AND pjr.timestamp > NOW() - INTERVAL '3 months'
AND pjrt.status = 12
AND t.name NOT ILIKE '%sig-sippy%'
),
JobRunOutputCheck AS (
SELECT
prow_job_run_id,
job_name,
BOOL_AND(test_output ILIKE '%failed 2 times%') AS all_failed_tests_match_pattern
FROM RelevantFailedTests
GROUP BY prow_job_run_id, job_name
)
SELECT
COUNT(prow_job_run_id) AS total_job_runs_analyzed,
COUNT(prow_job_run_id) FILTER (WHERE all_failed_tests_match_pattern = TRUE) AS job_runs_meeting_criteria,
COUNT(prow_job_run_id) FILTER (WHERE all_failed_tests_match_pattern = FALSE) AS job_runs_not_meeting_criteria,
CASE
WHEN COUNT(prow_job_run_id) = 0 THEN 0.0
ELSE ROUND((COUNT(prow_job_run_id) FILTER (WHERE all_failed_tests_match_pattern = TRUE))::NUMERIC * 100 / COUNT(prow_job_run_id), 2)
END AS percentage_meeting_criteria
FROM JobRunOutputCheck
) AS sub;
- relates to
-
SHIPSTRAT-3 A successful nightly most nights
-
- Refinement
-
- links to