Uploaded image for project: 'CoreOS OCP'
  1. CoreOS OCP
  2. COS-3227

AI analysis for CoreOS pipeline failures

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • AI analysis for CoreOS pipeline failures
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • Done
    • 0% To Do, 0% In Progress, 100% Done
    • 0

      There is interest in the team for investigate the use of AI to help us analyze pipeline failures. A small group of us met today to stencil out what this would look like and to identify some steps to take to investigate.

      We'd like to emphasize that this is a learning experience. It may yield something useful or it may not. The goal is to learn.

      We created some bullets for this work:

      # The actual thing that processes logs -> nickname=thebeast
      
      - This is the AI component
      - Might be able to learn some lessons from log-detective 
          - https://log-detective.com/
          - https://github.com/fedora-copr/logdetective
          - https://github.com/fedora-copr/logdetective-website
          - Initital investigation -> log detective looks to be a frontend for interacting with the LLM/SLM
          - Right now they are focused on rpm build failures
              - We might be able to use their work to build something similar but for our own logs
      - Where do we run this thing?
          - Might be able to use Red Hat's internal AI lab stuff (models.corp)
          - Might be able to just run an extra VM (kubevirt) in our pipeline
          - What do we do for upstream?
              - Are we able to use the same instance for both upstream and downstream?
              - If we have to run separate instances how do we combine the training to make things more useful?
      - What model do we use?
          - Do we care about open source?
          - Granite because we have overlords??
          
          
      # What do we feed to thebeast?
      
      - At a minimum: kola test failure logs
      - Additionally? jenkins run logs
          - pushing to quay failure -> 502
          - failed signing
          - https://github.com/coreos/fedora-coreos-pipeline/issues/1079
          - package failures, missing packages
      
      
      # Can we pre-train thebeast with historical data?
      
      - May be able to go through all of our old slack messages to pull pasted error messages to do some pre-training for thebeast. 
      
      
      # How do we interact with thebeast?
      
      - Where we investigate failures today?
          - slack
          - matrix
          - GitHub PRs
      - How do we integrate AI into our existing workflow?
          - options:
              - 1. API to interact with model/log detective
              - 2. web front end to supply logs and prompt (like log detective)
              - 3. slack bot that can take commands in replies to notifications about failures
                  - /bot analyze 'dns resolution'
                      - slack bot identifies links from threaded message chain
                      - interacts with thebeast to register logs and failure annotation
                      - open question: does the bot retrieve the logs and upload them or does the bot tell thebeast where to retrieve the logs from?
                  - pros: it's where we are today when we investigate most failures
                  - cons: slack isn't the only place where we investigate failures
              - 4. have jenkins send the logs to the model on failure
                  - DWM: I think jenkins uploading automatically is a future state, i.e. once our model is trained and gives useful analysis we can then automatically submit failures AND potentially take action based on the analysis
              
              
      # First Steps
      
      - Grab access to the granite models and play around with them
      - Question to answer in investigation:
          - is Granite the appropriate thing for this?
              - LLM SLM (small language model), what's appropriate?
      - Look a little more in-depth at log-detective
      - Start investigating user interaction with thebeast
      
      # Next Steps
      
      - ID/provision dedicated infra for this effort
      - Establish RPC mechanism for interacting with thebeast
      - Start to build out the workflow for interaction (training) and (eventually) analysis
      

              Unassigned Unassigned
              rhn-gps-dmabe Dusty Mabe
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: