Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1026

Admin-defined node disruption policies: Phase 2 (GA)

XMLWordPrintable

    • BU Product Work
    • False
    • False
    • 0% To Do, 0% In Progress, 100% Done
    • Undefined
    • 0
    • Program Call

      Phase 2 Deliverable:

      GA support for a generic interface for administrators to define custom reboot/drain suppression rules. 

      Epic Goal

      • Allow administrators to define which machineconfigs won't cause a drain and/or reboot.
      • Allow administrators to define which ImageContentSourcePolicy/ImageTagMirrorSet/ImageDigestMirrorSet won't cause a drain and/or reboot
      • Allow administrators to define alternate actions (typically restarting a system daemon) to take instead.
      • Possibly (pending discussion) add switch that allows the administrator to choose to kexec "restart" instead of a full hw reset via reboot.

      Why is this important?

      • There is a demonstrated need from customer cluster administrators to push configuration settings and restart system services without restarting each node in the cluster. 
      • Customers are modifying ICSP/ITMS/IDMS outside post day 1/adding them+
      • (kexec - we are not committed on this point yet) Server class hardware with various add-in cards can take 10 minutes or longer in BIOS/POST. Skipping this step would dramatically speed-up bare metal rollouts to the point that upgrades would proceed about as fast as cloud deployments. The downside is potential problems with hardware and driver support, in-flight DMA operations, and other unexpected behavior. OEMs and ODMs may or may not support their customers with this path.

      Scenarios

      1. As a cluster admin, I want to reconfigure sudo without disrupting workloads.
      2. As a cluster admin, I want to update or reconfigure sshd and reload the service without disrupting workloads.
      3. As a cluster admin, I want to remove mirroring rules from an ICSP, ITMS, IDMS object without disrupting workloads because the scenario in which this might lead to non-pullable images at a undefined later point in time doesn't apply to me.

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • ...

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      Open questions::

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              rhn-support-mrussell Mark Russell
              rhn-support-mrussell Mark Russell
              Matthew Werner Matthew Werner
              Derrick Ornelas Derrick Ornelas
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: