Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7546

Default Router PDB Allows 2 Disruptions with 3 Replicas

XMLWordPrintable

    • Important
    • No
    • 3
    • Sprint 234, Sprint 235, Sprint 236, Sprint 237, Sprint 238, Sprint 239
    • 6
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: We currently set maxUnavailable as a 50% percentage for replica count < 4, and in the pod disruption budget, Kubernetes rounds up the number of pods that may be disrupted. Therefore, if you have 3 replicas and you set maxUnavailable to "50%", it means 2 replicas may be disrupted. Pod disruption budgets cause problems for people who want to run small clusters or drain all workers at once, but providing high availability prevails for this bug.


      Consequence: When there are only three replicas, this rounds up to allow two disruptions, leaving only a single replica. When possible, it is better to always leave two replicas.

      Fix: We currently set the disruption budget to 25% for replica count >= 4, and this changes it to 25% for replica count >= 3.

      Result: When there are 3 or fewer replicas, 2 replicas remain up and running whenever possible.
      Show
      Cause: We currently set maxUnavailable as a 50% percentage for replica count < 4, and in the pod disruption budget, Kubernetes rounds up the number of pods that may be disrupted. Therefore, if you have 3 replicas and you set maxUnavailable to "50%", it means 2 replicas may be disrupted. Pod disruption budgets cause problems for people who want to run small clusters or drain all workers at once, but providing high availability prevails for this bug. Consequence: When there are only three replicas, this rounds up to allow two disruptions, leaving only a single replica. When possible, it is better to always leave two replicas. Fix: We currently set the disruption budget to 25% for replica count >= 4, and this changes it to 25% for replica count >= 3. Result: When there are 3 or fewer replicas, 2 replicas remain up and running whenever possible.
    • Done

      Description of problem:

      maxUnavailable defaults to 50% for anything under 4: https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/poddisruptionbudget.go#L71
      
      Based on PDB rounding logic, it always rounds to the next while integer, so 1.5 becomes 2.
      
      spec:
        maxUnavailable: 50%
        selector:
          matchLabels:
            ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
        currentHealthy: 3
        desiredHealthy: 1
        disruptionsAllowed: 2
      
      Where as with 4 router pods, we only allow 1 of 4 to be disrupted at a time. 

      Version-Release number of selected component (if applicable):

      4.x

      How reproducible:

      Always

      Steps to Reproduce:

      1. Set 3 replicas
      2. Look at the disruptionsAllowed on the PDB
      

      Actual results:

      You can take down 2 of 3 routers at once, leaving no HA.

      Expected results:

      With 3+ routers, we should always ensure 2 are up with the PDB.

      Additional info:

      Reduce the maxUnavailable to 25% for >= 3 pods instead of 4

              cholman@redhat.com Candace Holman
              rhn-support-mrobson Matt Robson
              Shudi Li Shudi Li
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: