-
Epic
-
Resolution: Done
-
Undefined
-
None
-
None
-
AI analysis for CoreOS pipeline failures
-
False
-
-
False
-
Not Selected
-
Done
-
0% To Do, 0% In Progress, 100% Done
-
0
There is interest in the team for investigate the use of AI to help us analyze pipeline failures. A small group of us met today to stencil out what this would look like and to identify some steps to take to investigate.
We'd like to emphasize that this is a learning experience. It may yield something useful or it may not. The goal is to learn.
We created some bullets for this work:
# The actual thing that processes logs -> nickname=thebeast - This is the AI component - Might be able to learn some lessons from log-detective - https://log-detective.com/ - https://github.com/fedora-copr/logdetective - https://github.com/fedora-copr/logdetective-website - Initital investigation -> log detective looks to be a frontend for interacting with the LLM/SLM - Right now they are focused on rpm build failures - We might be able to use their work to build something similar but for our own logs - Where do we run this thing? - Might be able to use Red Hat's internal AI lab stuff (models.corp) - Might be able to just run an extra VM (kubevirt) in our pipeline - What do we do for upstream? - Are we able to use the same instance for both upstream and downstream? - If we have to run separate instances how do we combine the training to make things more useful? - What model do we use? - Do we care about open source? - Granite because we have overlords?? # What do we feed to thebeast? - At a minimum: kola test failure logs - Additionally? jenkins run logs - pushing to quay failure -> 502 - failed signing - https://github.com/coreos/fedora-coreos-pipeline/issues/1079 - package failures, missing packages # Can we pre-train thebeast with historical data? - May be able to go through all of our old slack messages to pull pasted error messages to do some pre-training for thebeast. # How do we interact with thebeast? - Where we investigate failures today? - slack - matrix - GitHub PRs - How do we integrate AI into our existing workflow? - options: - 1. API to interact with model/log detective - 2. web front end to supply logs and prompt (like log detective) - 3. slack bot that can take commands in replies to notifications about failures - /bot analyze 'dns resolution' - slack bot identifies links from threaded message chain - interacts with thebeast to register logs and failure annotation - open question: does the bot retrieve the logs and upload them or does the bot tell thebeast where to retrieve the logs from? - pros: it's where we are today when we investigate most failures - cons: slack isn't the only place where we investigate failures - 4. have jenkins send the logs to the model on failure - DWM: I think jenkins uploading automatically is a future state, i.e. once our model is trained and gives useful analysis we can then automatically submit failures AND potentially take action based on the analysis # First Steps - Grab access to the granite models and play around with them - Question to answer in investigation: - is Granite the appropriate thing for this? - LLM SLM (small language model), what's appropriate? - Look a little more in-depth at log-detective - Start investigating user interaction with thebeast # Next Steps - ID/provision dedicated infra for this effort - Establish RPC mechanism for interacting with thebeast - Start to build out the workflow for interaction (training) and (eventually) analysis