Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-4549

Have a metric to track/count the cluster where UnsafeFailForward has been used with the OLM

    XMLWordPrintable

Details

    • Feature Request
    • Resolution: Unresolved
    • Undefined
    • None
    • None
    • OLM
    • None
    • False
    • None
    • False
    • Not Selected
    • 0
    • 0% 0%

    Description

      1. Proposed title of this feature request
      In the OLM we have now a feature called `UnsafeFailForward` to let the cluster admin opt-in in failover mechanism to the next available version when the upgrade of an OLM managed application is getting stuck for every reason:
      https://olm.operatorframework.io/docs/advanced-tasks/unsafe-fail-forward-upgrades/

      This is basically a safety net when something goes wrong.
      The feature is quite hidden (it's documented only upstream) with the idea of having it suggested and guided just by our support team.

      On the other side, as operator authors or as another engineer in the support team it will be really interesting to know:

      • how many clusters on the field should enable this feature to get rid of a specific bugged release
      • if a cluster consumed something like this in the past with possible future implications (eg. leftovers...)

      So the idea is to have a metric to count when and how `UnsafeFailForward` got used.
      The metric will enable us to track it with Telemetry and the Insight tool.

      More technical details are tracked here: https://docs.google.com/document/d/1KVEyQqg9Kwq93rfX9dOPwE98M_Hs33uRsNJXboLnxHY/edit#heading=h.of662m97fj1v

      2. What is the nature and description of the request?
      Expose a new metric to let us detect if/when `UnsafeFailForward` got used with Telemetry and Insight.

      3. Why does the customer need this? (List the business requirements here)
      The customers are not really supposed to directly consume the metric but:

      • operator authors will be able to see how many customers had to skip a specific upgrade-bugged release (with Telemetry)
      • the support team (with Insight) will be able to easily detect that the cluster used `UnsafeFailForward` in the past with possible actual implications.

      4. List any affected packages or components.
      OLM

      Attachments

        Activity

          People

            DanielMesser Daniel Messer
            stirabos Simone Tiraboschi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: