-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
False
-
None
-
False
-
Not Selected
-
-
1. Proposed title of this feature request
In the OLM we have now a feature called `UnsafeFailForward` to let the cluster admin opt-in in failover mechanism to the next available version when the upgrade of an OLM managed application is getting stuck for every reason:
https://olm.operatorframework.io/docs/advanced-tasks/unsafe-fail-forward-upgrades/
This is basically a safety net when something goes wrong.
The feature is quite hidden (it's documented only upstream) with the idea of having it suggested and guided just by our support team.
On the other side, as operator authors or as another engineer in the support team it will be really interesting to know:
- how many clusters on the field should enable this feature to get rid of a specific bugged release
- if a cluster consumed something like this in the past with possible future implications (eg. leftovers...)
So the idea is to have a metric to count when and how `UnsafeFailForward` got used.
The metric will enable us to track it with Telemetry and the Insight tool.
More technical details are tracked here: https://docs.google.com/document/d/1KVEyQqg9Kwq93rfX9dOPwE98M_Hs33uRsNJXboLnxHY/edit#heading=h.of662m97fj1v
2. What is the nature and description of the request?
Expose a new metric to let us detect if/when `UnsafeFailForward` got used with Telemetry and Insight.
3. Why does the customer need this? (List the business requirements here)
The customers are not really supposed to directly consume the metric but:
- operator authors will be able to see how many customers had to skip a specific upgrade-bugged release (with Telemetry)
- the support team (with Insight) will be able to easily detect that the cluster used `UnsafeFailForward` in the past with possible actual implications.
4. List any affected packages or components.
OLM