-
Feature
-
Resolution: Obsolete
-
Major
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
Feature Overview (aka. Goal Summary)
The SLO dashboard that we used to have in grafana stopped working due to January's sunsetting of the RHOBS that was being used in ROSA.
While we do no longer have the dashboard, the need for it has not changed nor has it been served completely by other means. The current situation is that we have some degree of coverage over our SLOs from the fact that important breaches in them would snowball and affect the overall ROSA/HCP SLOs that SREs track. However, the agility in that is not a situation that should persist in time.
Goals (aka. expected user outcomes)
Increased HyperShift Operator reliability in ROSA by closer tracking of trends.
Requirements (aka. Acceptance Criteria):
- There is a Dashboard in Dynatrace specific to the operation of HyperShift
- The dashboard needs to cover integration, stage, production canary and the different production sectors.
- The hypershift release duty documentation is expanded to point to it and explain its maintenance and usage.
- Weekly reporting of the team SLOs to SRE changes to it.
- If possible it should be definied either in the hypershift github repo or in the hypershift team gitlab repo and synced to dynatrace.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | managed |
Classic (standalone cluster) | no |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all supported ROSA configurations |
Connected / Restricted Network | all supported ROSA configurations |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all supported ROSA configurations |
Operator compatibility | all supported ROSA configurations |
Backport needed (list applicable versions) | no |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Only in Dynatrace |
Other (please specify) |
Use Cases (Optional):
- Alerting of HyperShift malfunction
- Tracking trends to the performance of HyperShift
- Collecting metrics around known incidents and minor deterioration to inform decisions.
Out of Scope
- New metrics and or panels over what we used to have.
Documentation Considerations
Only needs documenting in the team ops repository.
Interoperability Considerations
It would be useful to consider if it can be done in a way that is similar to what we'll need for ARO/HCP.