Uploaded image for project: 'OpenShift Builds'
  1. OpenShift Builds
  2. BUILD-238

Monitoring for Shared Resource CSI Driver

XMLWordPrintable

    • Sprint 207, Sprint 209, Sprint 210, Sprint 211, Sprint 212

      User Story

      As an OpenShift cluster admin
      I want to be able to monitor various statistics around the controller and csi volume driver provided by https://github.com/openshift/csi-driver-projected-resource
      So that I can understand the csi driver's behavior

      Acceptance Criteria

      • Cluster admins can monitor number of volume mounts that succeed and fail.
      • Cluster admins can see the number of shares on the cluster.
      • During development, we do some scale testing minimally validate the metrics are recorded correctly.  Then, if we discover places the driver can be improved, we either do those improvements with this Jira if they are containable, or we open BZs to track.

      Docs Impact

      No impact to product documentation - metrics are not in product documentation. Upstream repository should have markdown documentation on what metrics exist, what they mean.

      QE Impact

      CI tests will need to verify that we are exposing the appropriate metric through integration and e2e testing.
      QE can verify by querying the metric in the Monitoring/Prometheus console.

      PX Impact

      None.

      Notes

      Docs on Prometheus

      Shared resource CSI driver instances will need to expose relevant metrics to Prometheus.
      Prometheus has libraries which make it easy to set up a http process that exposes the Prometheus metrics.
      A Service needs to then expose the metric from the pods - see https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/bindata/v3.11.0/openshift-controller-manager/svc.yaml
      Monitoring team needs to be consulted.

      Mount attempts that succeed or fail could be a counter. Probably one counter metric with two labels (can have more, ideally no more than 10 labels).
      Number of shares on the cluster must be a gauge metric that is computed every time the prometheus endpoint is hit.

      Exposing to Telemetry is out of scope - we can take this on in a separate story. See BUILD-345

      Open questions:

      What are the metrics that a cluster admin would care about?

      • Number of mounts that succeed vs fail - in the future we can consider creating an alert.
      • Number of shares on the cluster - this impacts memory usage.
      • Number of unique Secrets and ConfigMaps that are shared - also impacts memory usage. Not sure if this is feasible with the current implementation.

      What metrics do we get "out of the box" with Kubernetes/OpenShift?

       

      The motivations for this story are similar to https://issues.redhat.com/browse/BUILD-246 in that both help understand

      how performant, scalable, and healthy the driver is.

              irum@redhat.com Alice Rum (Inactive)
              gmontero@redhat.com Gabe Montero
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: