-
Story
-
Resolution: Done
-
Undefined
-
None
-
None
-
False
-
None
-
False
-
Testable
-
No
-
-
-
-
No
-
No
-
Pending
-
None
-
-
From Jeff: It would be good to get a basic metric that provides insight into whether customers are using the feature. For example, the number of deployed models at the cluster level.
We just need to determine what metric would work for us and add it to the rhods rules at https://github.com/red-hat-data-services/odh-manifests/blob/master/monitoring/base/rhods-rules.yaml
Part of the broader R11:
Inference performance metrics. Users must be able to access performance metrics for all deployed models # P0:: Avg. response time over period of time (eg. last 24 hours or last week/month to gauge trends over time) at the individual model level
- P0: Number of requests over defined period of time (including option for all time) at the individual model level
- P0: Ability to view metrics at both the individual model and model server levels
- P0: CPU/GPU/memory utilization
- P0: configurable alerts based on defined thresholds:
- Avg. response time
- CPU/GPU/memory utilization
- Number of requests (eg. above or below or certain threshold)
- TBD: number of errors / failures in defined time period