Uploaded image for project: 'Container / Cluster Management (XCM) Strategy'
  1. Container / Cluster Management (XCM) Strategy
  2. XCMSTRAT-110

Milestone-1: Day 2 pid-limit configurable for customer workloads

    XMLWordPrintable

Details

    • False
    • False
    • Green
    • 100
    • 100% 100%
    • Hide

       
      Feature is GA :  rosa 1.2.31 is released on November 27th
      OCM UI will be released on Dec 20th As per Bala's comment from 11/8, UI scope will be delivered in a later milestone with M3 XCMSTRAT-382
       


      OCM 29 Nov:

      • ROSA CLI 1.2.31 is currently syncing to the mirror
      • Once 1.2.31 is synced, docs will be published
      • OCM UI stories started based on UX for informing mode

      OCM 28 Nov:

      • OCM API for day 2 PIDs limits enabled in production
        • Customers can use the feature via OCM CLI or directly via the OCM API
      • ROSA CLI will now ship on November 30th

      OCM 27 NOV: **

      • Waiting on the QE testing in the rosa 1.2.31 rc and rosa is expected to be released currently on Nov 29th. 
      • Docs  are ready and will be published when rosa is published to the mirror. 
      • Color status changed to Yellow since the due date is missed by couple of days. 

       

      OCM 21 Nov:

      • Epic https://issues.redhat.com/browse/OCM-2340 is now closed with all testing complete
      • All code is deployed to production and ready to be turned on (currently behind a feature flag)
      • ROSA 1.2.31 will include the changes required to support this feature

      OCM 14 Nov:

      • All required implementation tickets closed bar `rosa create kubeletconfig` which is pending OCM-4812
        • OCM-4812 is a minor fix, with PR up today
      • Most changes already deployed to production
      • Code for push of analytics to Amplitude under review
        • Data structures confirmed with Jake Lucky

      OCM 7 Nov:

      • API Model Changes Released
      • OCM Go SDK Changes Released in 0.1.381
      • Backend implementation deployed to production, hidden behind feature toggle and defaulted to off
      • ROSA CLI
        • Merged to master, ready for review
          • rosa describe kubeletconfig
          • rosa delete kubeletconfig
        • Pending Code Review
          • rosa create kubeletconfig
          • rosa edit kubeletconfig
      • Supporting docs with final updates

      QE 17 Nov:

      • API Testing finished –
        • No critical issues, cluster will be in unstable status when operate too frequently and upgrade ongoing, but as we get aligned operation limitation is  not current scope, trying to find a stable reproduce step, but not high priority
        • Covered profiles: rosa-sts, rosa non-sts, OSD-GCP-CCS, OSD-AWS redhat fully managed, rosa shared vpc, GCP shared vpc. Covered version at 4.12.x
        • API testing automation finished and code merged
        • finished OCM-3850  Testing finished without issue
        •  
      • ROSA cli testing finished –
        • Left issues closed{}
        • Automation finished
      • Doc not ready for reviewing
      • Something want to highlight. There is an option in API exposed as node_drain_grace_period, it is used by MUO only details here. And MCO didn't expose the configuration yet, details here . Just in case customers confused about it.
      • I will close OCM-2340 after the final card got tc-approved

      OCM  1 Nov:

      • SDK changes released for PID limits

      OCM 27 OCT:

      OCM 24 OCT:

      • Implementation has started in CS to support day 2 operations
        • Target is to complete most of the backend and CLI implementation over the current sprint
          • Work on the epic has been decomposed into small tasks to allow the work to progress in parallel
      • UI will be delivered by December 20th as part of the Q4 deliverables
        • UX/UI working on designs as a priority and hopefully first versions available this week
      • Need to work with BU, SRE and Account Teams to ensure that we have a migration plan for customers that have existing PID limit exceptions in place
        • @rblake to write a proposal in the DDR and circulate for agreement/discussion

      OCM 18 OCT:

      • Agreement with BU, SRE and engineering on the DDR
      • Epic is tasked out with priority to deliver day 2 operations via CLI and API first: https://issues.redhat.com/browse/OCM-2340
      • Confirmed in Weekly Sync that other day 2 Operations will not be blocked by update to PID limits
      • Confirmed in Weekly Sync that a Service Log will be written when a user requests to change the PID limit
      • Confirmed in Weekly Sync that UI is not in scope for the first milestone
      • Confirmed in Weekly Sync with docs that additional wording is needed to make users aware that applying this change will result in all nodes (apart from Control Plane nodes) rebooting

      16th OCT:{}

      • QE: Researching pid configuration to make a test plan{}

      OCM 11th OCT:

      • Confirmed Cluster wide API will be acceptable for customers as workaround until we can support for per-MachinePool limits
      • Today scheduled sync call to agree on implementation, either OCM API changes or SRE support exception
      • Expect to have due date by EOD

      QE 10th OCT:

      • Not ready for testing

      Friday 6th OCT

      • Agreed that using custom MachineConfigPool to support this feature is not doable for ROSA Classic
      • A possible option is to support cluster-wide PID Limits. We are checking that this is acceptable for the customer
      • If yes, we will reconvene to decide how best to apply these requirements

      OCM 3 OCT 

      • Finalized OCM schema for Machine Configs and general alignment for Machine Configs for all machine set configs.
      • Working to resolve all open discussions on DDR
      • Next steps to progress an early prototype API this week
      • Align this week with UI, Docs and QE
      • ROSA CLI changes targetted for release 1.2.29 (shipping 8th of Nov)

      Docs updates for 23 Oct

      • draft for CLI related docs changes is here, needs review
      Show
        Feature is GA :  rosa 1.2.31 is released on November 27th OCM UI will be released on Dec 20th As per Bala's comment from 11/8, UI scope will be delivered in a later milestone with M3 XCMSTRAT-382   OCM 29 Nov: ROSA CLI 1.2.31 is currently syncing to the mirror Once 1.2.31 is synced, docs will be published OCM UI stories started based on UX for informing mode OCM 28 Nov: OCM API for day 2 PIDs limits enabled in production Customers can use the feature via OCM CLI or directly via the OCM API ROSA CLI will now ship on November 30th OCM 27 NOV:  ** Waiting on the QE testing in the rosa 1.2.31 rc and rosa is expected to be released currently on Nov 29th.  Docs  are ready and will be published when rosa is published to the mirror.  Color status changed to Yellow since the due date is missed by couple of days.    OCM 21 Nov: Epic https://issues.redhat.com/browse/OCM-2340 is now closed with all testing complete All code is deployed to production and ready to be turned on (currently behind a feature flag) ROSA 1.2.31 will include the changes required to support this feature OCM 14 Nov: All required implementation tickets closed bar `rosa create kubeletconfig` which is pending OCM-4812 OCM-4812 is a minor fix, with PR up today Most changes already deployed to production Code for push of analytics to Amplitude under review Data structures confirmed with Jake Lucky OCM 7 Nov: API Model Changes Released OCM Go SDK Changes Released in 0.1.381 Backend implementation deployed to production, hidden behind feature toggle and defaulted to off ROSA CLI Merged to master, ready for review rosa describe kubeletconfig rosa delete kubeletconfig Pending Code Review rosa create kubeletconfig rosa edit kubeletconfig Supporting docs with final updates QE 17 Nov: API Testing finished – No critical issues, cluster will be in unstable status when operate too frequently and upgrade ongoing, but as we get aligned operation limitation is  not current scope, trying to find a stable reproduce step, but not high priority Covered profiles: rosa-sts, rosa non-sts, OSD-GCP-CCS, OSD-AWS redhat fully managed, rosa shared vpc, GCP shared vpc. Covered version at 4.12.x API testing automation finished and code merged finished OCM-3850  Testing finished without issue   ROSA cli testing finished – Left issue s closed { } Automation finished Doc not ready for reviewing Something want to highlight. There is an option in API exposed as node_drain_grace_period, it is used by MUO only details here . And MCO didn't expose the configuration yet, details here . Just in case customers confused about it. I will close OCM-2340 after the final card got tc-approved OCM  1 Nov: SDK changes released for PID limits OCM 27 OCT: Demo for OCM API: https://drive.google.com/file/d/17U2Ik9I2gn9swKGl8Fhsx1R81RFEG2P8/view CLI work starting w/c 30th October OCM 24 OCT: Implementation has started in CS to support day 2 operations Target is to complete most of the backend and CLI implementation over the current sprint Work on the epic has been decomposed into small tasks to allow the work to progress in parallel UI will be delivered by December 20th as part of the Q4 deliverables UX/UI working on designs as a priority and hopefully first versions available this week Need to work with BU, SRE and Account Teams to ensure that we have a migration plan for customers that have existing PID limit exceptions in place @rblake to write a proposal in the DDR and circulate for agreement/discussion OCM 18 OCT: Agreement with BU, SRE and engineering on the DDR Epic is tasked out with priority to deliver day 2 operations via CLI and API first: https://issues.redhat.com/browse/OCM-2340 Confirmed in Weekly Sync that other day 2 Operations will not be blocked by update to PID limits Confirmed in Weekly Sync that a Service Log will be written when a user requests to change the PID limit Confirmed in Weekly Sync that UI is not in scope for the first milestone Confirmed in Weekly Sync with docs that additional wording is needed to make users aware that applying this change will result in all nodes (apart from Control Plane nodes) rebooting 16th OCT: { } QE: Researching pid configuration to make a test plan { } OCM 11th OCT: Confirmed Cluster wide API will be acceptable for customers as workaround until we can support for per-MachinePool limits Today scheduled sync call to agree on implementation, either OCM API changes or SRE support exception Expect to have due date by EOD QE 10th OCT : Not ready for testing Friday 6th OCT Agreed that using custom MachineConfigPool to support this feature is not doable for ROSA Classic A possible option is to support cluster-wide PID Limits. We are checking that this is acceptable for the customer If yes, we will reconvene to decide how best to apply these requirements OCM 3 OCT   Finalized OCM schema for Machine Configs and general alignment for Machine Configs for all machine set configs. Working to resolve all open discussions on DDR Next steps to progress an early prototype API this week Align this week with UI, Docs and QE ROSA CLI changes targetted for release 1.2.29 (shipping 8th of Nov) Docs updates for 23 Oct draft for CLI related docs changes is here , needs review
    • 0

    Description

      Feature Overview (aka. Goal Summary)  

      This feature will introduce Process IDs (PIDs) as a node-level resource for application pods that customers can manage and control.

      Process IDs and the number of processes are a fundamental resource on Linux hosts. Even when other resources like CPU, Storage, and Memory are available it is possible for some Pods to run out of process IDs and fail.

      This feature will allow customers to increase/set PIDs per Pod as allowed by the node allocatable. The feature will be delivered across multiple milestones to cover for all use cases (cluster level, per-machinepool level) across different topologies (HCP, Classic):

      1. M1 / XCMSTRAT-110 - API and ROSA CLI support on ROSA Classic
      2. M2 / XCMSTRAT-355 - Support for ROSA HCP 
      3. M3/ XCMSTRAT-382 - Support for all clients (UI,TF), Per-Machinepool, all allowed Pidlimit values 
      4. Backlog/XCMSTRAT-383 - Support for day-1 (cluster installation)

      This Jira is pruned to include the first milestone: providing cluster-wide configuration on ROSA and OSD on AWS clusters. 

      Goals (aka. expected user outcomes)

      • Configure podpidslimit for all worker nodes (i.e., all nodes of machine pools; all cluster nodes that are not control plane nodes)
      • PodPidsLimit values from 4096 (default) to 16,384 (soft limit) available to all clusters
      • No impacts to the control plane nodes.
      • When not set, the default value provided by OCP version will be applied. 
      • Support on OCP 4.11 and above
      • Customer can use ROSA CLI (MVP), OCM UI (follow-up) and Terraform (follow-up) to set this
      • Ability to modify this on an existing cluster - all nodes will be rebooted one at a time - potentially causing workload disruption
      • ROSA CLI and OCM UI to provide warning that changing this value will require machine pool nodes to reboot and disrupt the applications
      • Ability to set this configuration at the time of cluster creation (follow-up)
      • Support for ROSA clusters and OSD CCS on AWS clusters
      • ROSA and OSD DOCs updated how to use the feature.
      • OCM includes the field in the telemetry for tracking analytics on clusters that override the default values. 

      Documentation

      • The feature needs to be covered both creating and editing machine pools section as requested in the OSDOCS-6267. i.e., cover in the day-2 workflows.
      • Provide an use case or reasoning to set this value from other than the default. 
      • Provide a section on considerations including:
        • what happens if the value is not set
        • what happens when the value is updated (rolling over to machines with reboot, disruptive to workloads)
        • what happens when the value set gets exhausted (pods restarted/rescheduled?) etc.  

      Additional Information:

      • Opportunity: With the standard default kubelet configuration that only allows a fixed 4K limit on PIDs per Pod, those workloads that need more PIDs per Pod are unable to run and operate on Managed Services. The prospects and customers who today do that in self-managed OCP are unable to adopt ROSA because of missing configurability.

      References:

      1. Kubelet configuration spec part of Machine API : https://docs.openshift.com/container-platform/4.13/rest_api/machine_apis/containerruntimeconfig-machineconfiguration-openshift-io-v1.html#spec-containerruntimeconfig
      2. Kubernetes documentation on per-pod PIDs https://kubernetes.io/docs/concepts/policy/pid-limiting/ 

      Attachments

        Issue Links

          Activity

            People

              rh-ee-bchandra Balachandran Chandrasekaran
              rh-ee-bchandra Balachandran Chandrasekaran
              David Taylor, Dustin Row, Haoran Wang
              Rob Blake Rob Blake
              Xue Li Xue Li
              Shashank Karanth Shashank Karanth
              Balachandran Chandrasekaran Balachandran Chandrasekaran
              Lisa Lyman Lisa Lyman
              Votes:
              3 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: