Feature Overview
tldr; Build an background agent that simplifies how customers interact with ACM.
Users are already starting to interact with technology differently. The advent of concierge apps on mobile devices and in the home have changed user-interaction patters over the past decade. Software is a key enabler here, providing interactions with AI chat bots, AI searches, AI assistants, and more.
Agent-based user-interactions are not new - ServiceNow and other IT Ticketing systems have leveraged machine learning in the past to provide users with in-context support for the task at hand, whether that be an insurance claim, support case, travel itinerary changes etc. Obviously there is nothing novel or provocative in saying that a support agent - ML or AI or other - can be warranted within the IT Operations teams.
So it is natural that such interactions will be expected from Red Hat Advanced cluster management, a tool designed to lower the overall operational costs of platform engineering (IT spend reduction) and speed the delivery of systems to end-users (10x the DevEx).
There are 3 axes along which we can analyze this -
Axis-1
Agentic technology has typically been used to produce:
- Q&A systems or Chatbots
- Personal Assistants which can also do things, more sophisticated than just chatbots
- Daemon-like running in background silently doing things.
Initial goals: 2 and 3 are the ones we should target.
Axis-2
Agentic technology can:
- read data from - databases & APIs, could be Kubernetes or non-Kubernetes (think edge)
- produce outputs to a bring human(user) into the loop - like generate git PRs, events requiring approval, either of these before running a command.
- mutate state of the system like - delete a pod, shut down a cluster, enforce a policy.
Practically speaking, we are not targeting 3 but keeping focus on 1 and 2
Axis -3
AI/ML/Analytics can be used to:
- Generally interrogate system states
- Do problem determination
- Preempt problems
3 can be most valuable because it prevents problem in the first place. This is what our partner was doing with their AI. But it usually relies on the AI to mutate the system state in someway back to Axis - 2, point 3. - so we have to approach this in a balanced way.
Goals
This Section: Provide high-level goal statement, providing user context
and expected user outcome(s) for this feature. Before we execute an epic, we must pick one goal.
- Produce an ACM Personal assistant which can make customer interaction with ACM delightful. Like -
- create a policy and generate a PR and interact with the PR
- Guide the user to creating a placement including managed clusterset etc and generate a PR and interact with the PR
- Do deep searches to answer a question by looking at data from search or Hub Kube API or managed cluster Kube API and/or metric data as demanded by the questions. A typical question could be - which of my VMs are sitting idle for the last one month across a bunch of labels OR why is policy foo flip flopping
- Perhaps have knowledge to check the current state - why is my search collector crashing or why is my addon not connecting
- Other?!?!
- Produce an ACM agent running in the background which can prevent problems from happening or open Jira tickets with context. Like -
- adding a managed cluster to a hub cluster that is already loaded
- placing an workload on a managed cluster that is already loaded
- etc. To catch things before they become bad. Repairing things after they have gone bad is what we do today. Goal is to prevent things from going bad in the first place - and reduce cost for customer and Red Hat
- Open Jira if certificates would be expiring in x days
- etc
- It is assumed that the Personal Assistant may integrate with OpenShift Lightspeed down the line. But our immediate focus is not how to integrate - that will inevitably happen. But finding out - what to integrate. That is what is the niche functionality that the customer needs that only ACM knowhow can produce.
- We will have to get early versions of this in the demo systems and enhance it as we iterate.
- This will allow this to be demoed.
- This will prioritize the work to gather feedback on how the agent is performing along with capturing user feedback.
- This will also allow us to run this in a environment like customers and force us to consider where to run the Inference Server (LLM) etc upfront.
Requirements
This Section: A list of specific needs or objectives that a Feature must
deliver to satisfy the Feature.. Some requirements will be flagged as MVP.
If an MVP gets shifted, the feature shifts. If a non MVP requirement slips,
it does not shift the feature.
Requirement | Notes | isMvp? |
---|---|---|
Protection must be in place so that AI cannot mutate state without explicit human permission | YES | |
We should not bypass the Gitops mantra. AI is not meant to bypass the best principles of managing a fleet. | YES | |
Feedback collection- We must have a log of the questions being asked, the answers being given and user feedback. We should be able to examine this offline (perhaps ask the customer for a data dump) and improve | YES | |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. |
YES |
Release Technical Enablement | Provide necessary release enablement details and documents. |
YES |
(Optional) Use Cases
This Section:
- Main success scenarios - high-level user stories
- Alternate flow/scenarios - high-level user stories
- ...
Questions to answer
- Do we need to fine tune models or just can use prompts and agentic technology to solve the problems. We will learn as we go deeper
- As we iterate, we will see a overlap of tools (not agents necessarily) being created by different teams. And will need to adjust accordingly
- We will need RAG. Will our RAG be done at OpenShift Lightspeed level automatically? Can we get an handle to that agent ?
Out of Scope
- Do these agents logically belong to the Global Hub level or ACM level.
- This is separate from the effort to use ACM to deploy AI workloads like - Federated Learning, Multi Kueue etc.
Background, and strategic fit
This Section: What does the person writing code, testing, documenting
need to know? What context can be provided to frame this feature?
Assumptions
- Contents of this feature should not be affected by the choice of underlying frameworks etc.
Customer Considerations
- Customer should be able to use ACM without these agents - as is the case today
- When customers use agents, they will have the technical option to run the LLM:
-
- outside the cluster in RH/IBM hosted environment
- outside the cluster in their environment (data does not leave the customer periphery)
- in the ACM hub cluster itself
Documentation Considerations
Questions to be addressed:
- What educational or reference material (docs) is required to support this
product feature? For users/admins? Other functions (security officers, etc)? - Does this feature have a doc impact?
- New Content, Updates to existing content, Release Note, or No Doc Impact
- If unsure and no Technical Writer is available, please contact Content
Strategy. - What concepts do customers need to understand to be successful in
[action]? - How do we expect customers will use the feature? For what purpose(s)?
- What reference material might a customer want/need to complete [action]?
- Is there source material that can be used as reference for the Technical
Writer in writing the content? If yes, please link if available. - What is the doc impact (New Content, Updates to existing content, or
Release Note)?