Uploaded image for project: 'Managed Service - Streams'
  1. Managed Service - Streams
  2. MGDSTRM-10299

Remediate elevate-priority change

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • None
    • None
    • MK - Sprint 230, MK - Sprint 231

      Relates to OHSS-17337

      WHAT

      The change to elevate priority may cause existing broker / zookeeper pods to be evicted.

      WHY

      If the additional reserve / nodes are not immediately available, then existing broker / zookeeper pods may be evicted (circumventing the drain cleaner) and then must wait for the additional nodes to spin up to be available again.

      HOW

      The near-term solution is to disable elevate-priority in the strimzi bundle configmap for the affected version.

      Longer term one of the following needs to be done:
      1. Confirm / update the production reserve, such that there is sufficient hot capacity to complete the rollout without causing eviction of existing brokers / zookeepers.
      2. Replace the priority class with one that is non-preempting. However this means that the new pods would no longer preempt the reserve, if cluster reached capacity the reserve could stay on the cluster ahead of these new pods. We'd have to ensure that production has more max node capacity in its machine pools (effectively capacity + reserve). Then once the rollout is completed and the priority class is replaced a preempting, another change would be required to roll that through - as the pods will not immediately pick up priority class changes.
      3. Switch this feature in the FSO to be more situational - that is detect when nodes are preventing broker placement and have the FSO delete the offending pods which should then allow (but not guarentee) subsequent deployments to appropriately consume the node, or the auto-scaler to reclaim the node.

      cc keithbwall rhn-engineering-rareddy mchitimb-1

              rhn-engineering-shawkins Steven Hawkins
              rhn-engineering-shawkins Steven Hawkins
              Kafka Fleet Services
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: