[RFE-1542] CVO prevent cluster upgrades in critical alerts - Red Hat Issue Tracker

Type: Feature Request
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: Over the Air
Labels:

Blocked:
False
Ready:
False
Release Note Text:
Undefined
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Background: ARO (Azure RedHat Openshift) ships almost vanilla OpenShift as managed services. Customers on ARO has full cluster-admin and in theory, are co-admin of the cluster. Customers are responsible for triggering the Y stream clusters upgrades or even Z releases. ARO strives to support and maintain as vanilla OpenShift as we can without modification or "taking over the features" from the customers.

Having said this, sometimes customers misconfigure the cluster with the changes, which if present breaks clusters during the upgrade.

Examples:

The customer added custom DNS to the vnet. This change is applied ONLY when VM is rotated. So once the upgrade rotated worker nodes goes into a not-ready state. In most cases, this causes capacity issues and rolling failure occurs.
Customer revokes or removes Cluster Service Principal permissions. This will not trigger any issues until credentials are used. So during upgrade, it halts the upgrade.
Customer configures with their firewalls rules so outbound traffic is available. In this case, we have seen cases where the cluster is able to trigger upgrade but the upgrade is stopped during some components not being able to reach redhat registries.

ARO is looking for an extendable way to let the customer know that issues are present in their clusters and prevent in-experienced administrators from triggering the upgrade if such a situation exists. We need a rapid way to do so for cases when new issues occur so we could deliver changes fast and at a rapid pace.

Proposal:

Extend ClusterVersionOperator with the capability to act on Critical alerts in Prometheus stack to prevent cluster upgrades from starting.

This way ARO operator on the cluster will emit the right metrics and trigger a custom alert so preventing customers from breaking the cluster. We agree this is a bit intrusive solution, but we are not able to come up with something extendable and configurable which would not be a hack on top of the existing platform.

Force options still should be possible to force the upgrade.

Running upgrades should not be impacted by firing alerts.

The feature could be gated by CVO configuration gate `prometheusAlertGates=false` set by default so the existing flow would not change.

Alerts would be visible in the customer UI letting them know about the issue.

Assignee:: Subin M

Reporter:: Mangirdas Judeikis (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2021/01/15 9:18 AM

Updated:: 2025/03/04 9:24 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide