-
Story
-
Resolution: Unresolved
-
Major
-
None
-
None
Story
As a user of RHDH, I want to make sure that the Rolling Demo & Our Dev Cluster Models are always available so that users (e.g. RH employees) are able to use this test environment and our feature set for each release.
Background
Sometimes is challenging when we deploy new versions of features or when our cluster gets upgraded, as there's a possibility that the demo environment is affected by those changes.
Ideally we should have in place a monitoring & alerting mechansim that will notify us every time an incident occurs.
Dependencies and Blockers
QE impacted work
Documentation impacted work
Acceptance Criteria
Create a Grafana/Prometheus gitops project that:
- Checks the availability of our models
- Checks the availability of Rolling Demo
We should also have alerting in place if any pod is pending or failing inside the vllm and rolling-demo-ns namespaces
upstream documentation updates (design docs, release notes etc)
Technical enablement / Demo