-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
Future Sustainability
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
The monitortest:
[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]
Causes a bit of chaos in openshift CI. The test was written to prevent exoplosive growth in apiserver watches, a threat to performance and scalability. The solution was to do some analysis that cannot be repeated now, generate a static file with the counts, then the test fails if they're over 10% beyond that number.
The problem is that watch counts grow over time, often. As such, most of what it finds is random noise, where someone needs to follow an uncodumented process to go update some fixed counts for their operator, ideally with some due diligence to make sure they understand what caused the increase.
Occasionally though, it catches a real and serious problem that would be a threat to the apiserver and would otherwise be invisible.
Actual watch count data is not automatically scraped and stored in bigquery for each job run. This uses the autodl framework, and you can see an example of the scraped data for a job run in the autodl file that is uploaded here. There is an interactive dashboard that can query that bigquery database. However none of this data automatically feeds into the static files in origin.
Because we only generate one generic test result, it is very difficult to track multiple on-going regressions for the test name in component readiness. Someone has to be watching the actual job runs and the test failure message to know if it's the right issue.
This card requests some combination of the following:
1. Generate one test per operator observed. The trick here is that we must generate success junits for a set of operators that could change. Looking at the sample autodl file from earlier, it's not a huge set. I think we'd basically make a junit result where the test name contains each operator name we see in that file. Every operator we observe making watch requests gets a junit. We must have the success junits so test pass rates are accurate, a test that can only fail causes problems in all our tooling.
1. Have the test output explain the test's intention, and how to maintain it. Self document the failure and how to respond safely.
1. Consider creation of an ai-helper to maintain this data. If we could query latest data from bigquery, compare it to the static data in origin. If it's just needs a slight bump, update the file in origin. If it's a major bump, highlight this and help file a bug.