Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-1

Observability Infrastructure and Enhanced metrics in MCO

    XMLWordPrintable

Details

    • Observability in MCO
    • OCPSTRAT-554Improving error handling, propagation, collection, and disambiguation for users
    • To Do
    • OCPSTRAT-554 - Improving error handling, propagation, collection, and disambiguation for users
    • 92
    • 92% 92%

    Description

      It became clear overtime that we need to enhance most of the MCO metrics that we have as well as adding more related to the MCC. The MCC is tasked with watching what's going on with pools and it makes sense to add more metrics and alerting especially there. There are various hiccups with metrics that we've been and are going through. This epic aims at addressing those and start working on adding more useful metrics/alerting to the MCO. Another aim for this epic would be (but we can split it out) to provide more data to help us proactively debug clusters when things go wrong.

      After spiking, the work for metric enhancement is split into the following way: 

      • Expose more pool health metrics, which includes (1) Expose metrics in MCD to enable node watcher (2) Expose metrics in MCO to enable MCP watcher  (3) Expose metrics in MCC, especially for MCC sub-controllers, to enable a comprehensive watcher on both node, pool and configs
      • oauth-proxy to kube-rbac-proxy migration for metric backend 
      • Metric infrastructure re-org to be ready for customization and CRD consumption
        • This part of the work is originally prioritized and under construction with a design focusing on metric centralization: with the introduce of the state controller in MCO-452, the MCO will use the state controller as a centralized metric registering, listening and reporting center. All the other sub-components of the MCO will report to the state controller when there is an update. By bringing in this unified infrastructure, the MCO provides the user with an entry point to touch metric configuration all at once. [USER CASE: the user can pass in a CRD with all the metrics they want to turn on, the state controller will then interpreting and syncing the customer-defined requirements passed in, and enable corresponding metrics accordingly] 
        • However, the implementation of the work is severely delayed due to (1) the re-design of the state controller (See updates for MCO-690) (2) the redesign for the message bus between the MCO sub-components and the state controller (See updates for MCO-751
        • It is no longer within the scope for 4.15 and will be tracked in MCO-846 for 4.16

      Attachments

        Issue Links

          Activity

            People

              rh-ee-iqian Ines Qian
              amurdaca@redhat.com Antonio Murdaca
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: