Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-17664

Surface endless reconcile loops from the endpoint-operator to the user

XMLWordPrintable

    • Quality / Stability / Reliability
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Provide the required acceptance criteria using this template.

      • ...
      Show
      Provide the required acceptance criteria using this template. ...
    • Observability Sprint 39, Observability Sprint 40
    • Important
    • None

      Context

      Many clients are handling the cluster-monitoring-config through gitops, applying conflicting updates with what the endpoint-operator adds. This triggers conflicting reconcile loops on both ends. On every configuration change, prometheus is restarted, disrupting the monitoring and alerting.

      What

      Define a way to surface these reconcile loops to the user.

      Possible solutions:

      • Add a platform serviceMonitor for the endpoint-monitor with alert rule to alert on this case. This is possibly the best solution but might not work if Prometheus is constantly restarting without being able to scrape metrics.
      • Detect these loops from inside the operator and degrade the addon state with relevant message

      Acceptance criteria:

      • Implement solution that detect these loops from inside the operator and degrade the addon state with relevant message
      • Write troubleshooting steps for what to do when in this situation (ACM docs or KCS)

              rh-ee-tmange Thibault Mange
              rh-ee-tmange Thibault Mange
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: