Loading...

XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
- coreos
- osintegration

Epic Name:
AI analysis for CoreOS pipeline failures
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Epic Status:
Done
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done

Cost of Delay:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

There is interest in the team for investigate the use of AI to help us analyze pipeline failures. A small group of us met today to stencil out what this would look like and to identify some steps to take to investigate.

We'd like to emphasize that this is a learning experience. It may yield something useful or it may not. The goal is to learn.

We created some bullets for this work:

# The actual thing that processes logs -> nickname=thebeast

- This is the AI component
- Might be able to learn some lessons from log-detective 
    - https://log-detective.com/
    - https://github.com/fedora-copr/logdetective
    - https://github.com/fedora-copr/logdetective-website
    - Initital investigation -> log detective looks to be a frontend for interacting with the LLM/SLM
    - Right now they are focused on rpm build failures
        - We might be able to use their work to build something similar but for our own logs
- Where do we run this thing?
    - Might be able to use Red Hat's internal AI lab stuff (models.corp)
    - Might be able to just run an extra VM (kubevirt) in our pipeline
    - What do we do for upstream?
        - Are we able to use the same instance for both upstream and downstream?
        - If we have to run separate instances how do we combine the training to make things more useful?
- What model do we use?
    - Do we care about open source?
    - Granite because we have overlords??
    
    
# What do we feed to thebeast?

- At a minimum: kola test failure logs
- Additionally? jenkins run logs
    - pushing to quay failure -> 502
    - failed signing
    - https://github.com/coreos/fedora-coreos-pipeline/issues/1079
    - package failures, missing packages


# Can we pre-train thebeast with historical data?

- May be able to go through all of our old slack messages to pull pasted error messages to do some pre-training for thebeast. 


# How do we interact with thebeast?

- Where we investigate failures today?
    - slack
    - matrix
    - GitHub PRs
- How do we integrate AI into our existing workflow?
    - options:
        - 1. API to interact with model/log detective
        - 2. web front end to supply logs and prompt (like log detective)
        - 3. slack bot that can take commands in replies to notifications about failures
            - /bot analyze 'dns resolution'
                - slack bot identifies links from threaded message chain
                - interacts with thebeast to register logs and failure annotation
                - open question: does the bot retrieve the logs and upload them or does the bot tell thebeast where to retrieve the logs from?
            - pros: it's where we are today when we investigate most failures
            - cons: slack isn't the only place where we investigate failures
        - 4. have jenkins send the logs to the model on failure
            - DWM: I think jenkins uploading automatically is a future state, i.e. once our model is trained and gives useful analysis we can then automatically submit failures AND potentially take action based on the analysis
        
        
# First Steps

- Grab access to the granite models and play around with them
- Question to answer in investigation:
    - is Granite the appropriate thing for this?
        - LLM SLM (small language model), what's appropriate?
- Look a little more in-depth at log-detective
- Start investigating user interaction with thebeast

# Next Steps

- ID/provision dedicated infra for this effort
- Establish RPC mechanism for interacting with thebeast
- Start to build out the workflow for interaction (training) and (eventually) analysis

Assignee:: Unassigned

Reporter:: Dusty Mabe

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/04/10 9:16 PM

Updated:: 2025/07/11 9:11 PM

Resolved:: 2025/07/11 9:11 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates