Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False

Target Version:

openshift-4.11, openshift-4.12

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

As an SREP engineer, I need to know when an individual cluster is failing clustersync. We have had several instances in just the last few weeks where failing to sync has caused issues on the cluster, including failing to update alerting rules, operator updates (preventing bugfixes), etc.

We need a way to identify these clusters on a cluster-by-cluster basis, as opposed to failing selectorSyncSets, which do not surface the failing cluster and therefore complicate the alerting and troubleshooting process, and will not scale. I'd argue we're well past the scale point already.

This is a pretty high priority need for SREP at the moment.

Done Criteria:

SREP has an endpoint or other mechanism to automatically query and alert when an individual cluster is failing clustersync
If this is an alert provided by hive a la cluster provisioning delays, it should perhaps have a threshold of 1 day, to allow for transient sync failures.

I am happy to discuss, answer questions and/or work with whoever is taking this task on.

is related to

HIVE-2285 Identify clusters that are in limited support on Hive clusters

Closed

HIVE-2286 Identify a reason in the ClusterSyncFailingSeconds metric

Closed

links to

openshift/hive#1809: Add ClusterSyncFailingSeconds metric

openshift/hive#1814: Add clustersyncfailing metric to crd

openshift/hive#1826: Fix clearing of ClusterSync Failing metric

Assignee:: Suhani Mehta

Reporter:: Chris Collins

QA Contact:: Mingxia Huang

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2022/04/26 7:05 PM

Updated:: 2023/08/03 9:12 AM

Resolved:: 2022/08/16 9:41 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates