Description of problem:
Several Azure-hosted OpenShift clusters are consistently triggering the CsvAbnormalFailedOver2Min alert after upgrading from 4.14 to 4.16, despite all ClusterServiceVersions and dependent operator resources being healthy. The alert appears to be false-positive, as investigation shows no actual downtime or unavailability. OLM pod logs report ComponentUnhealthy warnings for deployments that are, in fact, fully available.
Version-Release number of selected component (if applicable):
- OpenShift Container Platform: 4.16.x - Azure-hosted clusters
How reproducible:
No
Steps to Reproduce:
1.
2.
3.
Actual results:
CsvAbnormalFailedOver2Min alert fires even though: - CSV and operator instances are in Succeeded phase - All pods are available, running, and stable - No real operator failure detected
Expected results:
- The alert should only fire for actual failed deployments or unavailable operators - No alert should trigger when the CSV is fully available and healthy
Additional info:
- Issue is observed only on Azure-hosted clusters after minor upgrade from 4.14 → 4.16 - Must-gather will be attached for engineering review - No impact noticed on workloads, operator availability, or performance - Appears to be caused by transient conditions incorrectly interpreted as persistent failures by OLM alert logic