-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
Today we have an imperfect panic detection catching quite a few problems in the gather extra step by grepping the logs we pull down. However is the panic was not in the current/last pod log, it would not be detected.
Deads requested trying a better approach. His suggestion was to have origin stream all pod logs.
I fear this is too heavyweight, and instead think that Loki might be a better option, but needs investigation. Auth needs solving (auth to our grafana? or actual loki), and there could be delay between when logs show up in queries. (not sure how this works yet) \
Additionally this would need to be done in a way that origin remains usable for external users. We could disable if loki is not enabled, but then origin users would not get panic detection. Pod log streaming should work for anyone but likely dramatically increases the requirements for memory/cpu/network to run the origin tests in ways that could majorly impact the CI clusters, possibly the cluster under test as well.
We might be able to use loki once oauth secrets via TRT-1933 and PostAnalysis Framework are in place. These could be in addition to the existing panic detection that only matches captured artifacts currently