Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-2775

Make the service-ca pod high availability

    XMLWordPrintable

Details

    • Feature Request
    • Resolution: Done
    • Normal
    • None
    • None
    • API, Service Catalog
    • False
    • None
    • False
    • Not Selected
    • 0
    • 0% 0%

    Description

      1. Proposed title of this feature request

      Make the service-ca pod high availability

      2. What is the nature and description of the request?
      A bit of background for this is needed, in openshift 4.10, prometheus now relies on the service-ca pod to pull certificates for metrics for finalizers whenever a namespace is terminated.
      The service-ca pod is not HA and will be stuck on a single node, if the service-ca node is in a bad state but doesn't get scheduled to a healthy node, it could cause namespaces to be stuck terminating because the chain of dependability is not HA.

      This is ONLY a simulation for the problem to recreate the issue that can show up.

      kubectl create namespace test
      oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
      oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
      kubectl get pods -n openshift-service-ca -l app=service-ca

      1. Delete the single openshift-service-ca pod
        kubectl delete pods -n openshift-service-ca -l app=service-ca
      1. At this point, check the service-ca pod and it should be stuck in a CreateContainerConfigError state as described by https://access.redhat.com/solutions/5875621
      2. Try deleting the test namespace, you will see that you cannot because of the chain of dependability
        kubectl delete namespace test
      1. Fix the cluster by removing the scc from the group as described by https://access.redhat.com/solutions/5875621
        oc adm policy remove-scc-from-group anyuid system:authenticated system:serviceaccounts
        oc adm policy remove-scc-from-group privileged system:authenticated system:serviceaccounts
      1. Delete the service-ca pod so that it comes back with correct scc permissions
        kubectl get pods -n openshift-service-ca -l app=service-ca
      2. Delete the openshift-monitoring pods so that prometheus metrics can configure to the service-ca
        kubectl delete pods -n openshift-monitoring --all
        kubectl delete namespace test

      If the service-ca were to be HA, then I don't think this would be an issue unless all the nodes were in a bad state (at that point, you'd have other issues).

      3. Why does the customer need this? (List the business requirements here)

      Every user that attempts to delete a namespace or any other resource that require metrics could cause it to hang causing cluster issues

      4. List any affected packages or components.

      service-ca-operator

      Attachments

        Activity

          People

            wcabanba@redhat.com William Caban
            rhn-support-cruhm Courtney Ruhm
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: