Uploaded image for project: 'OpenShift GitOps'
  1. OpenShift GitOps
  2. GITOPS-3244

[Backport] Expose metrics to guage operator performance

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • Operator
    • None
    • 8
    • False
    • None
    • False
    • SECFLOWOTL-109 - GitOps Operator Code Redesign

      Story (Required)

      This story tracks the effort to port this story to the operator-refactoring branch

       

      As a consumer of GitOps operator in the service, I want to be able to have insight into the performance of the gitops operator through a set of well defined metrics that are exposed so that I can know how much load it can handle efficiently

      Background (Required)

      GitOps service uses the GitOps operator to deploy managed Argo CD instances. In order to provide a robust and efficient service, we need to be aware of what the operator's current performance limits are so that we can know where to make improvements in the future.

       

      The operator being bootstrapped using operator-sdk already runs a metrics server and serves some general  prometheus-friendly  controller-related metrics out of the box thanks to controller-runtime. The operator just needs to expose additional custom metrics that are specific to argo-cd 

      Out of scope

      creation of  metrics dashboards 

      Approach (Required)

      This work needs to go into argocd-operator

      1. Generate servicemonitor manifest that will be installed out of the box along with the operator using operator-sdk
      2. Create the following metrics and register them 
        - argocd_instances_reconciled (type guage, seclector: state)
        - argocd_reconciliation_duration (type histogram, selector: namespace)(other metrics like total_reconciliations, cpu/memory usage and no. of goroutines are already exposed by default)
      3. Go through reconciler code and find appropriate places to update above established metrics 
        see https://github.com/argoproj-labs/argocd-operator/pull/830 for reference
      4. after updation metrics are automatically exposed on the already running server
      5. Add unit/e2e tests to verify proper exposure of required metrics 

      See for more guidance:
      1. guide to expose controller-runtime metrics with operator-sdk  https://docs.okd.io/4.9/operators/operator_sdk/osdk-monitoring-prometheus.html
      2. prometheus metrics types and operations https://prometheus.io/docs/concepts/metric_types/

      3. default controller-runtime metrics exposed 

      https://book.kubebuilder.io/reference/metrics-reference.html

      Dependencies

      none 

      Acceptance Criteria (Mandatory)

      • Required metrics are exposed and can be accessed at /metrics end point
      • unit/e2e tests are added to verify behavior 

      INVEST Checklist

      Dependencies identified

      Blockers noted and expected delivery timelines set

      Design is implementable

      Acceptance criteria agreed upon

      Story estimated

      Legend

      Unknown

      Verified

      Unsatisfied

      Done Checklist

      • Code is completed, reviewed, documented and checked in
      • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
      • Continuous Delivery pipeline(s) is able to proceed with new code included
      • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
      • Acceptance criteria are met

            jrao@redhat.com Jaideep Rao
            jrao@redhat.com Jaideep Rao
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: