Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Discussed with Team:
No
Git Pull Request:
https://github.com/bf2fc6cc711aee1a0c2a/kas-sre-sops/pull/429
[QE] How to address?:
---
[QE] Why QE missed?:
---
Intelligence Requested:
Market:

Sprint:
MK - Sprint 230, MK - Sprint 231

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Relates to OHSS-17337

WHAT

The change to elevate priority may cause existing broker / zookeeper pods to be evicted.

WHY

If the additional reserve / nodes are not immediately available, then existing broker / zookeeper pods may be evicted (circumventing the drain cleaner) and then must wait for the additional nodes to spin up to be available again.

HOW

The near-term solution is to disable elevate-priority in the strimzi bundle configmap for the affected version.

Longer term one of the following needs to be done:
1. Confirm / update the production reserve, such that there is sufficient hot capacity to complete the rollout without causing eviction of existing brokers / zookeepers.
2. Replace the priority class with one that is non-preempting. However this means that the new pods would no longer preempt the reserve, if cluster reached capacity the reserve could stay on the cluster ahead of these new pods. We'd have to ensure that production has more max node capacity in its machine pools (effectively capacity + reserve). Then once the rollout is completed and the priority class is replaced a preempting, another change would be required to roll that through - as the pods will not immediately pick up priority class changes.
3. Switch this feature in the FSO to be more situational - that is detect when nodes are preventing broker placement and have the FSO delete the offending pods which should then allow (but not guarentee) subsequent deployments to appropriately consume the node, or the auto-scaler to reclaim the node.

cc keithbwall rhn-engineering-rareddy mchitimb-1

relates to

MGDSTRM-10300 Create separate blast radius control mechanisms

MGDSTRM-10160 Add ""elevate-priority" flag to RHOSAK bundle

Closed

mentioned on

Merge request - MGDSTRM-10299: [RHOSAK] Remediate elevate-priority change

Solved by commit e7c8928a76e9f73d3c681a8a3093f97851ef3e9d.

Assignee:: Steven Hawkins

Reporter:: Steven Hawkins

Team:: Kafka Fleet Services

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2022/12/16 1:44 PM

Updated:: 2023/01/26 9:12 AM

Resolved:: 2023/01/26 9:11 AM

Details

Description

WHAT

WHY

HOW

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates