Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-2716

[TP or GA] Performant and Highly Reliable Bare Metal Firmware Upgrades

XMLWordPrintable

    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview

      This feature delivers significant enhancements to the OpenShift Container Platform (OCP) Bare Metal firmware upgrade capability, focusing on performance, repeatability, and reliability. By optimizing the upgrade workflow executed by the Bare Metal Operator (BMO), the goal is to reduce overall upgrade downtime and achieve an industry-leading success rate of 95% or higher. This improvement is critical for Cluster Administrators managing production environments, particularly those with strict Service Level Agreements (SLAs), such as in Telco and O-RAN (Open Radio Access Network) deployments.

      Goals

      The primary goal is to enhance the Day-2 operational experience for Bare Metal infrastructure running OpenShift (server and NIC firmware upgrade/configuration). This feature will provide the Cluster Administrator persona with the ability to:

      • Execute Performant Firmware Upgrades: Significantly reduce the time required to complete a firmware upgrade on supported hardware, thus minimizing the maintenance window and impact on commercial traffic.
        • KPI (Key Performance Indicator): Upgrade time for Dell servers must not exceed 15-30 minutes.
        • KPI (Key Performance Indicator): Upgrade time for HPE servers must not exceed 30-60 minutes.
      • Achieve High Upgrade Success Rate: Ensure the firmware upgrade process is robust, repeatable, and resilient to transient errors, meeting a success rate of 95% or higher across supported hardware models.

      Requirements

      A list of specific needs or objectives that this feature must deliver in order to be considered complete.

      Functional Requirements

      • The underlying mechanism must implement performance optimizations (e.g., parallelization, improved transfer efficiency, reduced reboots) for the firmware upgrade process on supported Dell and HPE hardware.
      • The firmware upgrade process must complete successfully for at least 95% of attempts when operating on supported hardware and using validated firmware packages.
      • The total time to complete the firmware upgrade cycle (measured from initiation to the node being back in service) must meet the defined performance KPI
      KPI/Metric Definition Goal Max acceptable deviation  (threshold)
      Average component upgrade time Mean time taken to apply a single firmware update file (excluding server reboot/power cycle time) <4 min per components (e.g., BIOS, iDRAC, NIC) < 6 minutes
      Average Server Outage Time (OS Down) The total time a target server is unavailable to the operating system/hypervisor (from reboot initiation to OS boot completion) during the entire firmware installation cycle. < 20 minutes (for typical update bundles requiring a single reboot). <30 min  (maximum acceptable outage)
      Deployment Time (Full Cycle) The time taken from user initiation to final report generation for an entire fleet deployment (e.g.,100 servers). < 4 hours (including staging, pre-check, and staggered reboots). < 7-8 hours
      Rollback Success Rate Rollback Success RatePercentage of rollback attempts that successfully revert the server to the pre-update state. A rollback is defined as a downgrade to the previous f/w versions and/or settings|100% (Rollback is a mission-critical safety function).|  < 99%|
             
        • With Dell iDRAC + LLC:  The average server outage time (Full Upgrade) is around 15 - 30 minutes (Typically faster than HPE due to optimized component sequencing under LCC/Redfish, often requiring one fewer reboot).
        • For HPe iLo + SUM/SPP, it is around 30 - 60 minutes (Community reported times often land in the 45-minute to 1-hour range for a full compliance update using SUM or OneView integration, driven by inventory/compliance checks
      • The whole procedure must require maximum 1 reboot of the server.
      • The enhancement must integrate seamlessly with the current firmware upgrade procedure defined in OCPSTRAT-1794, maintaining the existing user flow for initiation and monitoring.

      Non-Functional Requirements

      • Reliability: The enhancement must maintain the existing reliability and fallback procedures established in the current implementation (OCPSTRAT-1794), ensuring a defined recovery path in the event of an unsuccessful upgrade.
      • Performance: All changes must be verified against the specified time-based KPIs for both Dell and HPE platforms.
      • Usability: No changes to the existing Cluster Administrator workflow for firmware upgrades will be introduced, adhering to the procedures defined in OCPSTRAT-1794. The focus is on internal process optimization.
      • Security: The enhancement must adhere to all existing security procedures for firmware image sourcing, integrity checks, and BMC (Baseboard Management Controller) interaction established in OCPSTRAT-1794.

      Hardware Bill of Materials (BoM)

      The performance and reliability targets are specifically focused on the following server models:

      Vendor Priority Models
      Dell Technologies P1 Dell XR8620t
      Dell XR8720t
      Hewlett Packard Enterprise (HPE) P2 DL110 Gen 11
      DL110 Gen 12

      User Scenario

      • As a Cluster Administrator, I want to upgrade the firmware on production Dell and HPE servers in my OCP cluster with a success rate of 95% or higher and in the shortest possible time so that I can minimize cluster downtime, ensure regulatory compliance, and limit the impact on commercial traffic during the scheduled maintenance window.

      Open Questions

      • Can KPIs defined in the Requirements section be achieved?
      • What is the starting point  (current value) for each KPI (per hardware vendor)?
      • Are there any dependencies which can impact KPIs?
      • What is the strategy to achieve KPIs defined in this feature?

      Out of Scope

      •  

      Links

              mzasepa Michal Zasepa
              mzasepa Michal Zasepa
              None
              None
              None
              None
              Avani Bhatt Avani Bhatt
              Derrick Ornelas Derrick Ornelas
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: