Uploaded image for project: 'Hybrid Cloud Console'
  1. Hybrid Cloud Console
  2. RHCLOUD-40894

Agentic nudge to record WebRCA event

XMLWordPrintable

    • 5
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • Unset
    • None

      WebRCA currently produces an AI summary of the incident. However, this only works if people include regular status updates. In this ticket I propose nudging people to produce summaries based on the active conversations in various internal channels. 

      Given a channel #ITN-2025-12345, watch the chat for significant findings and long slack threads. Once the agent determines a "significance threshold" was reached (ideas below) or 30 minutes have passed with no events, create a summary of what was said and propose that to be added as a WebRCA event. If rejected, the user still has a prompt and summary to build off of for their own event.

      Initial ideas for defining "significance" are below.

      Category 1: Diagnostics & Root Cause Analysis

      This category focuses on discovering the "why" behind the incident. An update is significant if it moves the team closer to a definitive root cause.

      • Has a specific, recurring error message been isolated?
        • Example: "Everyone is seeing the ImagePullBackOff error on nodes in the us-east-1a zone."
      • Has a correlation been established between an event and the incident?
        • Example: "The failures started right after the pull-secret was rotated."
      • Has a specific component been credibly implicated?
        • Example: "We've confirmed the Ingress Controller is not routing traffic to the new pods, that seems to be the bottleneck."
      • Has a diagnostic test (or a user/operator action) revealed critical information?
        • Example: "When we bypassed the network policy, the service could connect to the database. It's definitely a policy issue."
      • Has a hypothesis been confirmed or refuted with evidence?
        • Example: "We thought it was DNS, but nslookup resolves correctly from within the pod. We can rule out CoreDNS."

      Category 2: Actions & Remediation Efforts

      This category tracks the "what" is being done to fix the problem. An update is significant if it describes a concrete action taken to mitigate or resolve the incident.

      • Has a rollback or reversion been initiated?
        • Example: "We are rolling back the cluster-authentication-operator to the previous stable version."
      • Has a configuration change been applied as a fix?
        • Example: "The resource limits on the billing-service DeploymentConfig have been increased from 256Mi to 1Gi."
      • Has a node, pod, or component been manually restarted, drained, or cordoned?
        • Example: "We've cordoned node-7 and are draining it to reschedule the affected pods."
      • Has a workaround been successfully implemented?
        • Example: "As a temporary fix, we've scaled the replica count to 5, and performance has stabilized."
      • Has a ticket been escalated or an external team engaged?
        • Example: "We've opened a high-priority ticket with the networking hardware vendor."

      Category 3: Change in Impact or Scope

      This category focuses on understanding the "how bad" and "how widespread" the incident is. An update is significant if it redefines the incident's blast radius.

      • Has the number of affected users, services, or nodes changed significantly?
        • Example: "This is no longer isolated to the dev environment; we're seeing the same errors on two production clusters."
      • Has the severity of the impact changed?
        • Example: "The service was just degraded, but now it's completely unavailable. We need to upgrade the severity."
      • Has the incident been contained or has the "bleeding" stopped?
        • Example: "The fix has been applied to the primary cluster and we are no longer seeing new errors. The issue is now contained."
      • Has a new, unexpected symptom appeared?
        • Example: "In addition to the API latency, we're now seeing etcd leadership elections failing."

      Category 4: Key Milestones & Resolution

      This category marks the major turning points in the incident lifecycle.

      • Has the incident been officially declared "resolved"?
        • Example: "We've monitored for 30 minutes and the system is stable. We're marking this incident as resolved."
      • Has a post-mortem or root cause analysis (RCA) been scheduled or started?
        • Example: "Let's schedule the post-mortem for tomorrow to discuss the timeline and preventative actions."
      • Has a follow-up action been identified to prevent recurrence?
        • Example: "Follow-up: We need to add alerting for disk pressure on etcd nodes."

      What to Ignore (Defining Non-Significance)

      To reduce noise, the agent should be instructed to ignore:

      • General chatter: Greetings, expressions of sympathy ("that sounds bad"), or off-topic discussions.
      • Pure speculation without data: "Maybe it's a solar flare?" or "I wonder if the database is slow."
      • Redundant information: The fifth person confirming that the API is slow when it's already an established fact.
      • Questions without answers: A person asking "Any updates?" is not significant, but the answer to their question might be.
      • Ambiguous or vague statements: "I tried something" or "I think it might be getting better."

              Unassigned Unassigned
              geowa4.openshift George Adams
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: