Uploaded image for project: 'OpenShift GitOps'
  1. OpenShift GitOps
  2. GITOPS-2820

Expose metrics to guage operator performance

    XMLWordPrintable

Details

    • Story
    • Resolution: Done
    • Major
    • 1.10.0
    • None
    • Operator
    • None
    • GITOPS Sprint 237, GITOPS Sprint 238, GITOPS Sprint 239, GITOPS Sprint 240, GITOPS Sprint 241, GITOPS Sprint 243

    Description

      Story (Required)

      As a consumer of GitOps operator in the service, I want to be able to have insight into the performance of the gitops operator through a set of well defined metrics that are exposed so that I can know how much load it can handle efficiently

      Background (Required)

      GitOps service uses the GitOps operator to deploy managed Argo CD instances. In order to provide a robust and efficient service, we need to be aware of what the operator's current performance limits are so that we can know where to make improvements in the future.

       

      The operator being bootstrapped using operator-sdk already runs a metrics server and serves some general  prometheus-friendly  controller-related metrics out of the box thanks to controller-runtime. The operator just needs to expose additional custom metrics that are specific to argo-cd 

      Out of scope

      creation of  metrics dashboards 

      Approach (Required)

      This work needs to go into argocd-operator

      1. Generate servicemonitor manifest that will be installed out of the box along with the operator using operator-sdk
      2. Create the following metrics and register them 
        - argocd_instances_reconciled (type guage, seclector: state)
        - argocd_reconciliation_duration (type histogram, selector: namespace)(other metrics like total_reconciliations, cpu/memory usage and no. of goroutines are already exposed by default)
      3. Go through reconciler code and find appropriate places to update above established metrics 
        see https://github.com/argoproj-labs/argocd-operator/pull/830 for reference
      4. after updation metrics are automatically exposed on the already running server
      5. Add unit/e2e tests to verify proper exposure of required metrics 

      See for more guidance:
      1. guide to expose controller-runtime metrics with operator-sdk  https://docs.okd.io/4.9/operators/operator_sdk/osdk-monitoring-prometheus.html
      2. prometheus metrics types and operations https://prometheus.io/docs/concepts/metric_types/

      3. default controller-runtime metrics exposed 

      [https://book.kubebuilder.io/reference/metrics-reference.html|\{_}{_}https://book.kubebuilder.io/reference/metrics-reference.html\{_}{_}]

      Dependencies

      none 

      Acceptance Criteria (Mandatory)

      • Required metrics are exposed and can be accessed at /metrics end point
      • unit/e2e tests are added to verify behavior 

      INVEST Checklist

      Dependencies identified

      Blockers noted and expected delivery timelines set

      Design is implementable

      Acceptance criteria agreed upon

      Story estimated

      Legend

      Unknown

      Verified

      Unsatisfied

      Done Checklist

      • Code is completed, reviewed, documented and checked in
      • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
      • Continuous Delivery pipeline(s) is able to proceed with new code included
      • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
      • Acceptance criteria are met

      Attachments

        Issue Links

          Activity

            People

              jrao@redhat.com Jaideep Rao
              jrao@redhat.com Jaideep Rao
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: