-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
None
-
ai-for-ovnk-troubleshooting
-
Product / Portfolio Work
-
-
81% To Do, 15% In Progress, 4% Done
-
False
-
-
False
-
Not Selected
-
None
-
None
Template:
Networking Definition of Planned
Epic Template descriptions and documentation
Epic Goal
OCP Core Networking Team did the spike work for the shift-week learning we had to see if this is something that is fruitful for us: https://docs.google.com/presentation/d/1glNUCcA8zpNY-ckwLIRXyedyY1WPUW-tkI17Jtjo3sg/edit?slide=id.g36b12eb63d6_0_0#slide=id.g36b12eb63d6_0_0 **
The results were pretty good which is why we decided to do this as a first AI project to improve our team's efficiency.
MCP for Troubleshooting OVN-Kubernetes Issues:
The main bottleneck to agree on is:
- Should we do our own localized trained model that holds the context of what we teach it?
- Should we continue to rely on the external models like GEMINI/CLAUDE etc
In our shift week experience we saw that using external models, we loose the context window and it starts hallucinating and sometimes we need to re-teach it.
But for this effort let's pick 1-3 specific features - say EgressIPs/UDNs/BGP or anything but ensure that has a narrow scope. EgressIPs for example which has most unstable code and hits most number of escalations - we pick that feature, we teach the model everything it needs to know about EgressIPs -> layered fashion - start with OVN-Kubernetes, then OVN and then OVS and see how good it performs in failure scenarios and being able to debug issues that are commonly hit by our customers.
Another thing to decide is:
- Real time on-cluster data v/s
- Simulation from customer clusters the data via must-gather as if it were a real cluser
which use case must we hit first? Cause the 1st point is more for the team itself 2nd point is better because end goal is to help customer issues debugging where we won't have direct access to clusters. Ensuring this tool can help 1st line support and reducing the toil on the core networking team is what we are after.
Then key indicators of success is say integrating it with the OCP Mustgather or other information we get from live bugs and testing this out - i.e try out the poc in field and see what results we get based on live customer bugs. That to me is the measure of success.
Kubernetes-MCP-Server was pretty good in itself along with that good to investigate the new must-gather MCP server and more importantly decide on the model we want to build or re-use. -> something that should be easy for each member to teach and interact with over the course of time. - something maintainable.
Why is this important?
Planning Done Checklist
The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status
Priority+ is set by engineering
Epic must be Linked to a +Parent Feature
Target version+ must be set
Assignee+ must be set
(Enhancement Proposal is Implementable
(No outstanding questions about major work breakdown
(Are all Stakeholders known? Have they all been notified about this item?
Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
- Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
- The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.
Additional information on each of the above items can be found here: Networking Definition of Planned
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement
details and documents.
...
Dependencies (internal and external)
1.
...
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>