-
Task
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
-
False
-
Unset
-
None
-
Platform A&M Sprint 53, Platform A&M Sprint 54, Platform A&M Sprint 55, Platform A&M Sprint 56, Platform A&M Sprint 57, Platform A&M Sprint 58, Platform A&M Sprint 59, Platform A&M Sprint 60, Platform A&M Sprint 61
Openshift clusters are occasionally reporting "degraded" status due to error encountered during CCX data uploads.
See https://issues.redhat.com/browse/SDB-3091 and https://issues.redhat.com/browse/CCXDEV-9209.
The errors seem to be related to authentication issues reported by UHC-proxy:
https://github.com/RedHatInsights/uhc-auth-proxy/blob/master/server/server.go#L141
We need enhance logging and monitoring so we can better determine the root cause of the errors:
(1) Add cloudwatch logging so we have better log retention. Logging should include detail error codes/messages.
(2) Add Prometheus metrics so we can get a better view of failure patterns.
(3) Add alerts based on (2)
- clones
-
RHCLOUD-21184 Enhance Monitoring for 5xx API errors
- Closed