Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: openshift-dedicated
Labels:
None

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

1. Proposed title of this feature request:

Enhanced notifications for changes and incidents on OSD-managed clusters

2. What is the nature and description of the request?

Requested evolution
We understand and accept the managed service model and the need for SRE to act quickly to preserve platform stability.
We are not asking Red Hat to delay or block these actions.
We request enhancements in two areas: change notifications and incident / event notifications:

3.1 Enhanced change notifications for SRE-managed operations:
1.	Best-effort advance notification when possible
For changes that are not emergency actions (for example, planned infrastructure scaling, scheduled maintenance windows, non-critical optimizations), we request best-effort advance notification via one or more of:
o	Service Logs
o	Cluster notifications
o	Email
2.	Immediate post-change notification with useful details
When an SRE-managed action is performed on our cluster (for example, node replacement, infrastructure resize, critical component restart), we request an automatic notification that includes at least:
o	What changed (node type, instance group, component)
o	When it happened (timestamp)
o	A high-level reason, such as:
	Resource pressure
	Unhealthy node
	Policy / lifecycle rule
	Bug workaround
o	Whether customer action is required:
	Yes/No
	If yes, a short description of the required action or a link to documentation

3.2 Enhanced incident / event notifications for SRE-managed components
1.	Notification for "non-SLA" incidents that still impact us operationally
Even if an event is not considered an "incident" by the OSD service definition (no SLA breach, no visible downtime at platform level), it can still:
o	Cause visible disruption for our customers (for example, temporary unavailability of some applications due to external integrations), or
o	Trigger alerts in our monitoring, requiring investigation by our teams.
We request that such events can optionally be treated as "operational incidents" for notification purposes, so that:
o	A short operational summary is provided
o	We can forward a brief, understandable explanation to our customers
2.	Improved visibility into SRE investigations and actions
When an SRE team is actively responding to alerts on our cluster (for example, high resource usage on infrastructure nodes), we request:
o	A Service Log entry at the start of the investigation
o	A final Service Log entry or case update describing the outcome and remediation

3. Why does the customer need this? (List the business requirements here):

1. Context
We run production workloads on OpenShift Dedicated (OSD) on GCP. Our clusters are managed by Red Hat SRE under the managed service model, including control plane and infrastructure nodes.
Based on recent "Return of Experience" discussions and the feedback summarized in:
•	1.1 - No change notifications in advance
•	1.2 - No incident notifications
we would like to formally request an evolution of the service regarding notifications for changes and incidents on SRE-managed components.

2. Current behaviour (as we understand it)
2.1 Change notifications (1.1)
•	Control plane and infrastructure nodes can be replaced or resized at any time by OpenShift automation or SRE actions in order to maintain platform stability and meet the OSD SLA.
•	In practice, we observe events such as:
o	Infrastructure node resizes / replacements
o	Control plane node replacements
•	These actions currently happen without advance notice to the customer. Often, the only trace is a Service Log / Cluster History entry after the fact.
•	For us, these changes can have indirect impact, for example on:
o	GCP instance groups
o	Monitoring and alerting
o	Other external integrations

2.2 Incident notifications (1.2)
•	As per the service definition, Red Hat sends incident notifications and may provide an RCA only for incidents that impact the OSD SLA.
•	Events such as automatic control plane node replacement or infrastructure node replacement, even when triggered by an alert or degraded state, are often classified as "normal lifecycle operations", not as incidents.
•	As a result, we often do not receive:
o	A proactive notification that an SRE action has started
o	A short explanation of what happened and why, which we can reuse with our own customers

4. List any affected packages or components.

OSD

Assignee:: Shreyans Mulkutkar

Reporter:: Satyam Burhade

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/12/19 12:00 PM

Updated:: 2026/01/16 12:22 PM

Target start:: None

Target end:: None

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates