-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.19.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
On Oct 20th, AWS had a major outage in us-east-1. During this outage, IAM services returned internal errors. During this time, hypershift_cluster_invalid_aws_creds metric was set to "1", meaning the credentials are invalid. This is not accurate. SRE uses this metric to automate notifications to customers and disable alerting for clusters.
Version-Release number of selected component (if applicable):
N/A
How reproducible:
100%
Steps to Reproduce:
1. Set ValidAWSCredentials to Unknown to simulate AWS returning 500s / healthchecks not working
2. Check metric state
3.
Actual results:
Metric is set to 1 (invalid credentials) for unknown states
Expected results:
Metric should distinguish between true/false/unknown
Additional info: