Loading...

XML

Word

Printable

Details

Type: Feature Request
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: API, Service Catalog
Labels:
- sb_candidate

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
Hierarchy Progress:
0
Hierarchy Progress Bar:

0% 0%

SFDC Cases Counter:
SFDC Cases Links:

Description

1. Proposed title of this feature request

Make the service-ca pod high availability

2. What is the nature and description of the request?
A bit of background for this is needed, in openshift 4.10, prometheus now relies on the service-ca pod to pull certificates for metrics for finalizers whenever a namespace is terminated.
The service-ca pod is not HA and will be stuck on a single node, if the service-ca node is in a bad state but doesn't get scheduled to a healthy node, it could cause namespaces to be stuck terminating because the chain of dependability is not HA.

This is ONLY a simulation for the problem to recreate the issue that can show up.

kubectl create namespace test
oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
kubectl get pods -n openshift-service-ca -l app=service-ca

Delete the single openshift-service-ca pod
kubectl delete pods -n openshift-service-ca -l app=service-ca

At this point, check the service-ca pod and it should be stuck in a CreateContainerConfigError state as described by https://access.redhat.com/solutions/5875621
Try deleting the test namespace, you will see that you cannot because of the chain of dependability
kubectl delete namespace test

Fix the cluster by removing the scc from the group as described by https://access.redhat.com/solutions/5875621
oc adm policy remove-scc-from-group anyuid system:authenticated system:serviceaccounts
oc adm policy remove-scc-from-group privileged system:authenticated system:serviceaccounts

Delete the service-ca pod so that it comes back with correct scc permissions
kubectl get pods -n openshift-service-ca -l app=service-ca
Delete the openshift-monitoring pods so that prometheus metrics can configure to the service-ca
kubectl delete pods -n openshift-monitoring --all
kubectl delete namespace test

If the service-ca were to be HA, then I don't think this would be an issue unless all the nodes were in a bad state (at that point, you'd have other issues).

3. Why does the customer need this? (List the business requirements here)

Every user that attempts to delete a namespace or any other resource that require metrics could cause it to hang causing cluster issues

4. List any affected packages or components.

service-ca-operator

Attachments

Activity

People

Assignee:: William Caban

Reporter:: Courtney Ruhm

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2022/04/12 4:38 PM

Updated:: 2023/07/31 7:35 PM

Resolved:: 2023/03/20 3:43 PM