Uploaded image for project: 'Red Hat Internal Developer Platform'
  1. Red Hat Internal Developer Platform
  2. RHIDP-11545

Enhance monitoring & observability for Dev Cluster & AI Rolling Demo

    • RHDH AI Sprint 3288

      Story

      As a user of RHDH, I want to make sure that the Rolling Demo & Our Dev Cluster Models are always available so that users (e.g. RH employees) are able to use this test environment and our feature set for each release.

      Background

      Sometimes is challenging when we deploy new versions of features or when our cluster gets upgraded, as there's a possibility that the demo environment is affected by those changes.

      Ideally we should have in place a monitoring & alerting mechansim that will notify us every time an incident occurs.

      Dependencies and Blockers

      QE impacted work

      Documentation impacted work

      Acceptance Criteria

      Create a Grafana/Prometheus gitops project that:

      • Checks the availability of our models
      • Checks the availability of Rolling Demo

      We should also have alerting in place if any pod is pending or failing inside the vllm and rolling-demo-ns namespaces

      upstream documentation updates (design docs, release notes etc)

      Technical enablement / Demo

              rh-ee-tpetkos Theofanis Petkos
              rh-ee-tpetkos Theofanis Petkos
              RHDH AI
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: