Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8629

Enhanced notifications for changes and incidents on OSD-managed clusters

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • openshift-dedicated
    • None
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed title of this feature request:

      Enhanced notifications for changes and incidents on OSD-managed clusters 

      2. What is the nature and description of the request?

      Requested evolution
      We understand and accept the managed service model and the need for SRE to act quickly to preserve platform stability.
      We are not asking Red Hat to delay or block these actions.
      We request enhancements in two areas: change notifications and incident / event notifications:
      
      3.1 Enhanced change notifications for SRE-managed operations:
      1.	Best-effort advance notification when possible
      For changes that are not emergency actions (for example, planned infrastructure scaling, scheduled maintenance windows, non-critical optimizations), we request best-effort advance notification via one or more of:
      o	Service Logs
      o	Cluster notifications
      o	Email
      2.	Immediate post-change notification with useful details
      When an SRE-managed action is performed on our cluster (for example, node replacement, infrastructure resize, critical component restart), we request an automatic notification that includes at least:
      o	What changed (node type, instance group, component)
      o	When it happened (timestamp)
      o	A high-level reason, such as:
      	Resource pressure
      	Unhealthy node
      	Policy / lifecycle rule
      	Bug workaround
      o	Whether customer action is required:
      	Yes/No
      	If yes, a short description of the required action or a link to documentation
      
      3.2 Enhanced incident / event notifications for SRE-managed components
      1.	Notification for "non-SLA" incidents that still impact us operationally
      Even if an event is not considered an "incident" by the OSD service definition (no SLA breach, no visible downtime at platform level), it can still:
      o	Cause visible disruption for our customers (for example, temporary unavailability of some applications due to external integrations), or
      o	Trigger alerts in our monitoring, requiring investigation by our teams.
      We request that such events can optionally be treated as "operational incidents" for notification purposes, so that:
      o	A short operational summary is provided
      o	We can forward a brief, understandable explanation to our customers
      2.	Improved visibility into SRE investigations and actions
      When an SRE team is actively responding to alerts on our cluster (for example, high resource usage on infrastructure nodes), we request:
      o	A Service Log entry at the start of the investigation
      o	A final Service Log entry or case update describing the outcome and remediation

      3. Why does the customer need this? (List the business requirements here):

      1. Context
      We run production workloads on OpenShift Dedicated (OSD) on GCP. Our clusters are managed by Red Hat SRE under the managed service model, including control plane and infrastructure nodes.
      Based on recent "Return of Experience" discussions and the feedback summarized in:
      •	1.1 - No change notifications in advance
      •	1.2 - No incident notifications
      we would like to formally request an evolution of the service regarding notifications for changes and incidents on SRE-managed components.
      
      2. Current behaviour (as we understand it)
      2.1 Change notifications (1.1)
      •	Control plane and infrastructure nodes can be replaced or resized at any time by OpenShift automation or SRE actions in order to maintain platform stability and meet the OSD SLA.
      •	In practice, we observe events such as:
      o	Infrastructure node resizes / replacements
      o	Control plane node replacements
      •	These actions currently happen without advance notice to the customer. Often, the only trace is a Service Log / Cluster History entry after the fact.
      •	For us, these changes can have indirect impact, for example on:
      o	GCP instance groups
      o	Monitoring and alerting
      o	Other external integrations
      
      2.2 Incident notifications (1.2)
      •	As per the service definition, Red Hat sends incident notifications and may provide an RCA only for incidents that impact the OSD SLA.
      •	Events such as automatic control plane node replacement or infrastructure node replacement, even when triggered by an alert or degraded state, are often classified as "normal lifecycle operations", not as incidents.
      •	As a result, we often do not receive:
      o	A proactive notification that an SRE action has started
      o	A short explanation of what happened and why, which we can reuse with our own customers

      4. List any affected packages or components.

      OSD

       

              rh-ee-smulkutk Shreyans Mulkutkar
              rhn-support-sburhade Satyam Burhade
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                None
                None