Details
-
Feature Request
-
Resolution: Done
-
Normal
-
None
-
None
-
False
-
None
-
False
-
Not Selected
-
0
-
0%
Description
1. Proposed title of this feature request
Make the service-ca pod high availability
2. What is the nature and description of the request?
A bit of background for this is needed, in openshift 4.10, prometheus now relies on the service-ca pod to pull certificates for metrics for finalizers whenever a namespace is terminated.
The service-ca pod is not HA and will be stuck on a single node, if the service-ca node is in a bad state but doesn't get scheduled to a healthy node, it could cause namespaces to be stuck terminating because the chain of dependability is not HA.
This is ONLY a simulation for the problem to recreate the issue that can show up.
kubectl create namespace test
oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
kubectl get pods -n openshift-service-ca -l app=service-ca
- Delete the single openshift-service-ca pod
kubectl delete pods -n openshift-service-ca -l app=service-ca
- At this point, check the service-ca pod and it should be stuck in a CreateContainerConfigError state as described by https://access.redhat.com/solutions/5875621
- Try deleting the test namespace, you will see that you cannot because of the chain of dependability
kubectl delete namespace test
- Fix the cluster by removing the scc from the group as described by https://access.redhat.com/solutions/5875621
oc adm policy remove-scc-from-group anyuid system:authenticated system:serviceaccounts
oc adm policy remove-scc-from-group privileged system:authenticated system:serviceaccounts
- Delete the service-ca pod so that it comes back with correct scc permissions
kubectl get pods -n openshift-service-ca -l app=service-ca - Delete the openshift-monitoring pods so that prometheus metrics can configure to the service-ca
kubectl delete pods -n openshift-monitoring --all
kubectl delete namespace test
If the service-ca were to be HA, then I don't think this would be an issue unless all the nodes were in a bad state (at that point, you'd have other issues).
3. Why does the customer need this? (List the business requirements here)
Every user that attempts to delete a namespace or any other resource that require metrics could cause it to hang causing cluster issues
4. List any affected packages or components.
service-ca-operator