-
Epic
-
Resolution: Done
-
Undefined
-
None
-
None
-
AI analysis for CoreOS pipeline failures
-
False
-
-
False
-
Not Selected
-
Done
-
0% To Do, 0% In Progress, 100% Done
-
0
There is interest in the team for investigate the use of AI to help us analyze pipeline failures. A small group of us met today to stencil out what this would look like and to identify some steps to take to investigate.
We'd like to emphasize that this is a learning experience. It may yield something useful or it may not. The goal is to learn.
We created some bullets for this work:
# The actual thing that processes logs -> nickname=thebeast
- This is the AI component
- Might be able to learn some lessons from log-detective
- https://log-detective.com/
- https://github.com/fedora-copr/logdetective
- https://github.com/fedora-copr/logdetective-website
- Initital investigation -> log detective looks to be a frontend for interacting with the LLM/SLM
- Right now they are focused on rpm build failures
- We might be able to use their work to build something similar but for our own logs
- Where do we run this thing?
- Might be able to use Red Hat's internal AI lab stuff (models.corp)
- Might be able to just run an extra VM (kubevirt) in our pipeline
- What do we do for upstream?
- Are we able to use the same instance for both upstream and downstream?
- If we have to run separate instances how do we combine the training to make things more useful?
- What model do we use?
- Do we care about open source?
- Granite because we have overlords??
# What do we feed to thebeast?
- At a minimum: kola test failure logs
- Additionally? jenkins run logs
- pushing to quay failure -> 502
- failed signing
- https://github.com/coreos/fedora-coreos-pipeline/issues/1079
- package failures, missing packages
# Can we pre-train thebeast with historical data?
- May be able to go through all of our old slack messages to pull pasted error messages to do some pre-training for thebeast.
# How do we interact with thebeast?
- Where we investigate failures today?
- slack
- matrix
- GitHub PRs
- How do we integrate AI into our existing workflow?
- options:
- 1. API to interact with model/log detective
- 2. web front end to supply logs and prompt (like log detective)
- 3. slack bot that can take commands in replies to notifications about failures
- /bot analyze 'dns resolution'
- slack bot identifies links from threaded message chain
- interacts with thebeast to register logs and failure annotation
- open question: does the bot retrieve the logs and upload them or does the bot tell thebeast where to retrieve the logs from?
- pros: it's where we are today when we investigate most failures
- cons: slack isn't the only place where we investigate failures
- 4. have jenkins send the logs to the model on failure
- DWM: I think jenkins uploading automatically is a future state, i.e. once our model is trained and gives useful analysis we can then automatically submit failures AND potentially take action based on the analysis
# First Steps
- Grab access to the granite models and play around with them
- Question to answer in investigation:
- is Granite the appropriate thing for this?
- LLM SLM (small language model), what's appropriate?
- Look a little more in-depth at log-detective
- Start investigating user interaction with thebeast
# Next Steps
- ID/provision dedicated infra for this effort
- Establish RPC mechanism for interacting with thebeast
- Start to build out the workflow for interaction (training) and (eventually) analysis