A customer with using RH SSO 7.3.6 deployed on an OpenShift 4.3 cluster started receiving 503 errors from the SSO service following an OpenShift upgrade. This upgrade would have restarted all the pods that are a part of the deployment at some point, but exact order that the pods were restarted is unavailable (this is done automatically).
The RH SSO deployment was not able to recover itself, and required manual intervention to delete the impacted RH SSO pods in order to restore service.
The DeploymentConfig did have readiness/liveliness checks in it, but those checks did not seem to detect the pod was unhealthy (delivering 5XX errors) and the pods were not automatically restarted.
What can we do to:
A) try to improve the resilience of RH SSO so that if the application is erroring, that it attempts to restart itself
B) collect more debugging details in the future, to further serve A)