Loading...

XML

Word

Printable

Type: Feature
Resolution: Done
Priority: Major
Fix Version/s: Activation and Retention - Q32023
Affects Version/s: None
Component/s: ROSA Classic, ROSA HyperShift
Labels:
- PidsLimit
- SREsDevImpact-High
- SREsPerCoreImpact-Medium
- celonis-rfe
- mco
- mcsp-osd
- mcsp-rosa
- mobb
- mobb-11
- no-ui
- no-ux
- ocmq42023
- rosa-cli-1.2.30
- sd-collab-q4-2023

Blocked:
False
Ready:
False
Color Status:
Green
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Status Summary:
Hide

Feature is GA : rosa 1.2.31 is released on November 27th
OCM UI will be released ~~on Dec 20th~~ As per Bala's comment from 11/8, UI scope will be delivered in a later milestone with M3 XCMSTRAT-382

OCM 29 Nov:

ROSA CLI 1.2.31 is currently syncing to the mirror

Once 1.2.31 is synced, docs will be published

OCM UI stories started based on UX for informing mode

OCM 28 Nov:

OCM API for day 2 PIDs limits enabled in production

Customers can use the feature via OCM CLI or directly via the OCM API

ROSA CLI will now ship on November 30th

OCM 27 NOV: **

Waiting on the QE testing in the rosa 1.2.31 rc and rosa is expected to be released currently on Nov 29th.

Docs are ready and will be published when rosa is published to the mirror.

Color status changed to Yellow since the due date is missed by couple of days.

OCM 21 Nov:

Epic https://issues.redhat.com/browse/OCM-2340 is now closed with all testing complete

All code is deployed to production and ready to be turned on (currently behind a feature flag)

ROSA 1.2.31 will include the changes required to support this feature

OCM 14 Nov:

All required implementation tickets closed bar `rosa create kubeletconfig` which is pending OCM-4812

OCM-4812 is a minor fix, with PR up today

Most changes already deployed to production

Code for push of analytics to Amplitude under review

Data structures confirmed with Jake Lucky

OCM 7 Nov:

API Model Changes Released

OCM Go SDK Changes Released in 0.1.381

Backend implementation deployed to production, hidden behind feature toggle and defaulted to off

ROSA CLI

Merged to master, ready for review

rosa describe kubeletconfig

rosa delete kubeletconfig

Pending Code Review

rosa create kubeletconfig

rosa edit kubeletconfig

Supporting docs with final updates

QE 17 Nov:

API Testing finished –

No critical issues, cluster will be in unstable status when operate too frequently and upgrade ongoing, but as we get aligned operation limitation is not current scope, trying to find a stable reproduce step, but not high priority

Covered profiles: rosa-sts, rosa non-sts, OSD-GCP-CCS, OSD-AWS redhat fully managed, rosa shared vpc, GCP shared vpc. Covered version at 4.12.x

API testing automation finished and code merged

finished OCM-3850 Testing finished without issue

ROSA cli testing finished –

Left issue~~s closed~~{}

Automation finished

Doc not ready for reviewing

Something want to highlight. There is an option in API exposed as node_drain_grace_period, it is used by MUO only details here. And MCO didn't expose the configuration yet, details here . Just in case customers confused about it.

I will close OCM-2340 after the final card got tc-approved

OCM 1 Nov:

SDK changes released for PID limits

OCM 27 OCT:

Demo for OCM API: https://drive.google.com/file/d/17U2Ik9I2gn9swKGl8Fhsx1R81RFEG2P8/view

CLI work starting w/c 30th October

OCM 24 OCT:

Implementation has started in CS to support day 2 operations

Target is to complete most of the backend and CLI implementation over the current sprint

Work on the epic has been decomposed into small tasks to allow the work to progress in parallel

UI will be delivered by December 20th as part of the Q4 deliverables

UX/UI working on designs as a priority and hopefully first versions available this week

Need to work with BU, SRE and Account Teams to ensure that we have a migration plan for customers that have existing PID limit exceptions in place

@rblake to write a proposal in the DDR and circulate for agreement/discussion

OCM 18 OCT:

Agreement with BU, SRE and engineering on the DDR

Epic is tasked out with priority to deliver day 2 operations via CLI and API first: https://issues.redhat.com/browse/OCM-2340

Confirmed in Weekly Sync that other day 2 Operations will not be blocked by update to PID limits

Confirmed in Weekly Sync that a Service Log will be written when a user requests to change the PID limit

Confirmed in Weekly Sync that UI is not in scope for the first milestone

Confirmed in Weekly Sync with docs that additional wording is needed to make users aware that applying this change will result in all nodes (apart from Control Plane nodes) rebooting

16th OCT:{}

QE: Researching pid configuration to make a test plan{}

OCM 11th OCT:

Confirmed Cluster wide API will be acceptable for customers as workaround until we can support for per-MachinePool limits

Today scheduled sync call to agree on implementation, either OCM API changes or SRE support exception

Expect to have due date by EOD

QE 10th OCT:

Not ready for testing

Friday 6th OCT

Agreed that using custom MachineConfigPool to support this feature is not doable for ROSA Classic

A possible option is to support cluster-wide PID Limits. We are checking that this is acceptable for the customer

If yes, we will reconvene to decide how best to apply these requirements

OCM 3 OCT

Finalized OCM schema for Machine Configs and general alignment for Machine Configs for all machine set configs.

Working to resolve all open discussions on DDR

Next steps to progress an early prototype API this week

Align this week with UI, Docs and QE

ROSA CLI changes targetted for release 1.2.29 (shipping 8th of Nov)

Docs updates for 23 Oct

draft for CLI related docs changes is here, needs review
Show
Feature is GA : rosa 1.2.31 is released on November 27th OCM UI will be released on Dec 20th As per Bala's comment from 11/8, UI scope will be delivered in a later milestone with M3 XCMSTRAT-382 OCM 29 Nov: ROSA CLI 1.2.31 is currently syncing to the mirror Once 1.2.31 is synced, docs will be published OCM UI stories started based on UX for informing mode OCM 28 Nov: OCM API for day 2 PIDs limits enabled in production Customers can use the feature via OCM CLI or directly via the OCM API ROSA CLI will now ship on November 30th OCM 27 NOV: ** Waiting on the QE testing in the rosa 1.2.31 rc and rosa is expected to be released currently on Nov 29th. Docs are ready and will be published when rosa is published to the mirror. Color status changed to Yellow since the due date is missed by couple of days. OCM 21 Nov: Epic https://issues.redhat.com/browse/OCM-2340 is now closed with all testing complete All code is deployed to production and ready to be turned on (currently behind a feature flag) ROSA 1.2.31 will include the changes required to support this feature OCM 14 Nov: All required implementation tickets closed bar `rosa create kubeletconfig` which is pending OCM-4812 OCM-4812 is a minor fix, with PR up today Most changes already deployed to production Code for push of analytics to Amplitude under review Data structures confirmed with Jake Lucky OCM 7 Nov: API Model Changes Released OCM Go SDK Changes Released in 0.1.381 Backend implementation deployed to production, hidden behind feature toggle and defaulted to off ROSA CLI Merged to master, ready for review rosa describe kubeletconfig rosa delete kubeletconfig Pending Code Review rosa create kubeletconfig rosa edit kubeletconfig Supporting docs with final updates QE 17 Nov: API Testing finished – No critical issues, cluster will be in unstable status when operate too frequently and upgrade ongoing, but as we get aligned operation limitation is not current scope, trying to find a stable reproduce step, but not high priority Covered profiles: rosa-sts, rosa non-sts, OSD-GCP-CCS, OSD-AWS redhat fully managed, rosa shared vpc, GCP shared vpc. Covered version at 4.12.x API testing automation finished and code merged finished OCM-3850 Testing finished without issue ROSA cli testing finished – Left issue s closed { } Automation finished Doc not ready for reviewing Something want to highlight. There is an option in API exposed as node_drain_grace_period, it is used by MUO only details here . And MCO didn't expose the configuration yet, details here . Just in case customers confused about it. I will close OCM-2340 after the final card got tc-approved OCM 1 Nov: SDK changes released for PID limits OCM 27 OCT: Demo for OCM API: https://drive.google.com/file/d/17U2Ik9I2gn9swKGl8Fhsx1R81RFEG2P8/view CLI work starting w/c 30th October OCM 24 OCT: Implementation has started in CS to support day 2 operations Target is to complete most of the backend and CLI implementation over the current sprint Work on the epic has been decomposed into small tasks to allow the work to progress in parallel UI will be delivered by December 20th as part of the Q4 deliverables UX/UI working on designs as a priority and hopefully first versions available this week Need to work with BU, SRE and Account Teams to ensure that we have a migration plan for customers that have existing PID limit exceptions in place @rblake to write a proposal in the DDR and circulate for agreement/discussion OCM 18 OCT: Agreement with BU, SRE and engineering on the DDR Epic is tasked out with priority to deliver day 2 operations via CLI and API first: https://issues.redhat.com/browse/OCM-2340 Confirmed in Weekly Sync that other day 2 Operations will not be blocked by update to PID limits Confirmed in Weekly Sync that a Service Log will be written when a user requests to change the PID limit Confirmed in Weekly Sync that UI is not in scope for the first milestone Confirmed in Weekly Sync with docs that additional wording is needed to make users aware that applying this change will result in all nodes (apart from Control Plane nodes) rebooting 16th OCT: { } QE: Researching pid configuration to make a test plan { } OCM 11th OCT: Confirmed Cluster wide API will be acceptable for customers as workaround until we can support for per-MachinePool limits Today scheduled sync call to agree on implementation, either OCM API changes or SRE support exception Expect to have due date by EOD QE 10th OCT : Not ready for testing Friday 6th OCT Agreed that using custom MachineConfigPool to support this feature is not doable for ROSA Classic A possible option is to support cluster-wide PID Limits. We are checking that this is acceptable for the customer If yes, we will reconvene to decide how best to apply these requirements OCM 3 OCT Finalized OCM schema for Machine Configs and general alignment for Machine Configs for all machine set configs. Working to resolve all open discussions on DDR Next steps to progress an early prototype API this week Align this week with UI, Docs and QE ROSA CLI changes targetted for release 1.2.29 (shipping 8th of Nov) Docs updates for 23 Oct draft for CLI related docs changes is here , needs review
Product Documentation Required:
Yes

Risk Score:
0

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

This feature will introduce Process IDs (PIDs) as a node-level resource for application pods that customers can manage and control.

Process IDs and the number of processes are a fundamental resource on Linux hosts. Even when other resources like CPU, Storage, and Memory are available it is possible for some Pods to run out of process IDs and fail.

This feature will allow customers to increase/set PIDs per Pod as allowed by the node allocatable. The feature will be delivered across multiple milestones to cover for all use cases (cluster level, per-machinepool level) across different topologies (HCP, Classic):

M1 / ~~XCMSTRAT-110~~ - API and ROSA CLI support on ROSA Classic
M2 / XCMSTRAT-355 - Support for ROSA HCP
M3/ XCMSTRAT-382 - Support for all clients (UI,TF), Per-Machinepool, all allowed Pidlimit values
Backlog/XCMSTRAT-383 - Support for day-1 (cluster installation)

This Jira is pruned to include the first milestone: providing cluster-wide configuration on ROSA and OSD on AWS clusters.

Goals (aka. expected user outcomes)

Configure podpidslimit for all worker nodes (i.e., all nodes of machine pools; all cluster nodes that are not control plane nodes)
PodPidsLimit values from 4096 (default) to 16,384 (soft limit) available to all clusters
No impacts to the control plane nodes.
When not set, the default value provided by OCP version will be applied.
Support on OCP 4.11 and above
Customer can use ROSA CLI (MVP), OCM UI (follow-up) and Terraform (follow-up) to set this
Ability to modify this on an existing cluster - all nodes will be rebooted one at a time - potentially causing workload disruption
ROSA CLI and OCM UI to provide warning that changing this value will require machine pool nodes to reboot and disrupt the applications
Ability to set this configuration at the time of cluster creation (follow-up)
Support for ROSA clusters and OSD CCS on AWS clusters
ROSA and OSD DOCs updated how to use the feature.
OCM includes the field in the telemetry for tracking analytics on clusters that override the default values.

Documentation

~~The feature needs to be covered both creating and editing machine pools section as requested in the OSDOCS-6267. i.e., cover in the day-2 workflows.~~
Provide an use case or reasoning to set this value from other than the default.
Provide a section on considerations including:
- what happens if the value is not set
- what happens when the value is updated (rolling over to machines with reboot, disruptive to workloads)
- what happens when the value set gets exhausted (pods restarted/rescheduled?) etc.

Additional Information:

Opportunity: With the standard default kubelet configuration that only allows a fixed 4K limit on PIDs per Pod, those workloads that need more PIDs per Pod are unable to run and operate on Managed Services. The prospects and customers who today do that in self-managed OCP are unable to adopt ROSA because of missing configurability.

References:

Kubelet configuration spec part of Machine API : https://docs.openshift.com/container-platform/4.13/rest_api/machine_apis/containerruntimeconfig-machineconfiguration-openshift-io-v1.html#spec-containerruntimeconfig
Kubernetes documentation on per-pod PIDs https://kubernetes.io/docs/concepts/policy/pid-limiting/

is cloned by

OCPSTRAT-813 Clarity in custom MCP configuration application

Closed

is related to

MCO-639 create a priority order for MC merging

To Do

OCPSTRAT-825 Priority Order Change for merging MachineConfigs

Closed

relates to

MCO-205 [Spike] Preventing custom pool race conditions

To Do

RFE-2846 Hive managed customizations for Machine Configuration Operator to configure High PID and THP Support

Under Review

MCO-650 support booting into custom machine config pools

To Do

links to

KCS 6986931: Change pids_limit in OSD/ROSA/ARO 4.11 and newer

openshift/rosa#1596: OCM-4784 | fix: Ensure we validate maximum pids limit when creating/editing KubeletConfig

mentioned on

Merge request - OCM-4322 | feat: Added support for customers with capability to exceed default pids limits

Merge request - OCM-4323 | feat: Add capability to allow organization to bypass default pids limits

Merge request - OCM-4727 | feat: Create ServiceLog entry when user adjusts the PIDs limit on a cluster

Merge request - OCM-7281 | feat: Update /api/clusters_mgmt/v1/clusters/{id}/nodepools to accept KubeletConfigs

(1 relates to, 2 links to, 4 mentioned on)

Assignee:: Balachandran Chandrasekaran

Reporter:: Balachandran Chandrasekaran

Contributors:: David Taylor, Dustin Row, Haoran Wang

Developer:: Rob Blake

QA Contact:: Xue Li

Doc Contact:: Shashank Karanth

Product Manager:: Balachandran Chandrasekaran

UX or UI Contact:: Lisa Lyman

Votes:: 3 Vote for this issue

Watchers:: 22 Start watching this issue

Due:: 2023/11/27

Created:: 2023/06/02 2:03 PM

Updated:: 2024/07/15 6:49 PM

Resolved:: 2023/11/29 11:13 PM

Details

Description

Feature Overview (aka. Goal Summary)

Documentation

Attachments

Issue Links

Activity

People

Dates