-
Story
-
Resolution: Done
-
Undefined
-
None
-
BU Product Work
-
5
-
False
-
False
-
OCPSTRAT-475 - Enable sharing ConfigMaps and Secrets across namespaces [Tech Preview]
-
Undefined
-
-
Sprint 207, Sprint 209, Sprint 210, Sprint 211, Sprint 212
User Story
As an OpenShift cluster admin
I want to be able to monitor various statistics around the controller and csi volume driver provided by https://github.com/openshift/csi-driver-projected-resource
So that I can understand the csi driver's behavior
Acceptance Criteria
- Cluster admins can monitor number of volume mounts that succeed and fail.
- Cluster admins can see the number of shares on the cluster.
- During development, we do some scale testing minimally validate the metrics are recorded correctly. Then, if we discover places the driver can be improved, we either do those improvements with this Jira if they are containable, or we open BZs to track.
Docs Impact
No impact to product documentation - metrics are not in product documentation. Upstream repository should have markdown documentation on what metrics exist, what they mean.
QE Impact
CI tests will need to verify that we are exposing the appropriate metric through integration and e2e testing.
QE can verify by querying the metric in the Monitoring/Prometheus console.
PX Impact
None.
Notes
Docs on Prometheus
- Metric types: https://prometheus.io/docs/concepts/metric_types/
- Naming conventions: https://prometheus.io/docs/practices/naming/
Shared resource CSI driver instances will need to expose relevant metrics to Prometheus.
Prometheus has libraries which make it easy to set up a http process that exposes the Prometheus metrics.
A Service needs to then expose the metric from the pods - see https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/bindata/v3.11.0/openshift-controller-manager/svc.yaml
Monitoring team needs to be consulted.
Mount attempts that succeed or fail could be a counter. Probably one counter metric with two labels (can have more, ideally no more than 10 labels).
Number of shares on the cluster must be a gauge metric that is computed every time the prometheus endpoint is hit.
Exposing to Telemetry is out of scope - we can take this on in a separate story. See BUILD-345
Open questions:
What are the metrics that a cluster admin would care about?
- Number of mounts that succeed vs fail - in the future we can consider creating an alert.
- Number of shares on the cluster - this impacts memory usage.
- Number of unique Secrets and ConfigMaps that are shared - also impacts memory usage. Not sure if this is feasible with the current implementation.
What metrics do we get "out of the box" with Kubernetes/OpenShift?
The motivations for this story are similar to https://issues.redhat.com/browse/BUILD-246 in that both help understand
how performant, scalable, and healthy the driver is.
- blocks
-
BUILD-345 Expose CSI driver metrics to Telemetry
- Closed
- links to