Loading...

Type: Story
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Acceptance Criteria:
None
BZ requires_doc_text:
Unset
Regression:
None
BZ Keywords:
- Unset
Intelligence Requested:
Market:
PX Impact Score:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

WebRCA currently produces an AI summary of the incident. However, this only works if people include regular status updates. In this ticket I propose nudging people to produce summaries based on the active conversations in various internal channels.

Given a channel #ITN-2025-12345, watch the chat for significant findings and long slack threads. Once the agent determines a "significance threshold" was reached (ideas below) or 30 minutes have passed with no events, create a summary of what was said and propose that to be added as a WebRCA event. If rejected, the user still has a prompt and summary to build off of for their own event.

Initial ideas for defining "significance" are below.

Category 1: Diagnostics & Root Cause Analysis

This category focuses on discovering the "why" behind the incident. An update is significant if it moves the team closer to a definitive root cause.

Has a specific, recurring error message been isolated?

- Example: "Everyone is seeing the ImagePullBackOff error on nodes in the us-east-1a zone."

Has a correlation been established between an event and the incident?

- Example: "The failures started right after the pull-secret was rotated."

Has a specific component been credibly implicated?

- Example: "We've confirmed the Ingress Controller is not routing traffic to the new pods, that seems to be the bottleneck."

Has a diagnostic test (or a user/operator action) revealed critical information?

- Example: "When we bypassed the network policy, the service could connect to the database. It's definitely a policy issue."

Has a hypothesis been confirmed or refuted with evidence?

- Example: "We thought it was DNS, but nslookup resolves correctly from within the pod. We can rule out CoreDNS."

Category 2: Actions & Remediation Efforts

This category tracks the "what" is being done to fix the problem. An update is significant if it describes a concrete action taken to mitigate or resolve the incident.

Has a rollback or reversion been initiated?

- Example: "We are rolling back the cluster-authentication-operator to the previous stable version."

Has a configuration change been applied as a fix?

- Example: "The resource limits on the billing-service DeploymentConfig have been increased from 256Mi to 1Gi."

Has a node, pod, or component been manually restarted, drained, or cordoned?

- Example: "We've cordoned node-7 and are draining it to reschedule the affected pods."

Has a workaround been successfully implemented?

- Example: "As a temporary fix, we've scaled the replica count to 5, and performance has stabilized."

Has a ticket been escalated or an external team engaged?

- Example: "We've opened a high-priority ticket with the networking hardware vendor."

Category 3: Change in Impact or Scope

This category focuses on understanding the "how bad" and "how widespread" the incident is. An update is significant if it redefines the incident's blast radius.

Has the number of affected users, services, or nodes changed significantly?

- Example: "This is no longer isolated to the dev environment; we're seeing the same errors on two production clusters."

Has the severity of the impact changed?

- Example: "The service was just degraded, but now it's completely unavailable. We need to upgrade the severity."

Has the incident been contained or has the "bleeding" stopped?

- Example: "The fix has been applied to the primary cluster and we are no longer seeing new errors. The issue is now contained."

Has a new, unexpected symptom appeared?

- Example: "In addition to the API latency, we're now seeing etcd leadership elections failing."

Category 4: Key Milestones & Resolution

This category marks the major turning points in the incident lifecycle.

Has the incident been officially declared "resolved"?

- Example: "We've monitored for 30 minutes and the system is stable. We're marking this incident as resolved."

Has a post-mortem or root cause analysis (RCA) been scheduled or started?

- Example: "Let's schedule the post-mortem for tomorrow to discuss the timeline and preventative actions."

Has a follow-up action been identified to prevent recurrence?

- Example: "Follow-up: We need to add alerting for disk pressure on etcd nodes."

What to Ignore (Defining Non-Significance)

To reduce noise, the agent should be instructed to ignore:

General chatter: Greetings, expressions of sympathy ("that sounds bad"), or off-topic discussions.

Pure speculation without data: "Maybe it's a solar flare?" or "I wonder if the database is slow."

Redundant information: The fifth person confirming that the API is slow when it's already an established fact.

Questions without answers: A person asking "Any updates?" is not significant, but the answer to their question might be.

Ambiguous or vague statements: "I tried something" or "I think it might be getting better."

Details

Description

Category 1: Diagnostics & Root Cause Analysis

Category 2: Actions & Remediation Efforts

Category 3: Change in Impact or Scope

Category 4: Key Milestones & Resolution

What to Ignore (Defining Non-Significance)

Attachments

Easy Agile Planning Poker

Activity

People

Dates