-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
Proposed title of this feature request
OCM Dynamic Scoring Framework Add-on for RHACM
What is the nature and description of the request?
This request is to productize the upstream "Dynamic Scoring Framework" addon for Open Cluster Management (OCM) into RHACM as found at https://github.com/open-cluster-management-io/enhancements/tree/5c92d681b9c3180e5f18c901a996ad8a89a93cad/enhancements/sig-architecture/166-dynamic-scoring-framework-addon .
NOTE: We have described the information below but please refer to the upstream issues and details for possible changes.
This framework introduces a mechanism to offer Placement decisions based on custom external logic (scoring api) based on real time time-series data in addition to existing methods using static labels or resource claims.
The architecture consists of:
- Distributed Agents: A DynamicScoringAgent deployed on managed clusters that gathers real-time metrics (via Prometheus, OpenTelemetry, or another source). These agents are purposely placed on the managed clusters by design to ensure any excessive resource consumption is offloaded to the managed clusters and doesn’t create issues for the hub.
- Pluggable Scoring Logic: The agent sends these metrics to a user-defined Scoring API (Internal or External) to calculate a score. This also allows the user to BYO their own scoring engine to suit their specific needs and can be tuned and tailored to any industry requirements.
- Native Integration: The agent updates the existing AddonPlacementScore resource, allowing the standard RHACM Placement API to consume these scores immediately for scheduling decisions.
This allows for a "Hybrid Architecture" where data collection is local (ensuring freshness) but scoring logic can be centralized (saving resources).
Why does the customer need this? (List the business requirements here)
Customers managing complex, high-density, or specialized workloads have requirements that this feature can assist with:
- Resource Efficiency based on Real-Time Load: Customers need to place workloads on clusters that are currently underutilized, not just clusters that have the capacity. For example, dynamically relocating AI workloads based on real-time GPU/CPU power load predictions.
- Cost Optimization: Customers need to calculate placement scores based on variable factors like current electricity rates, hardware efficiency, or operational costs, which requires custom logic outside of Kubernetes.
- Integration with Red Hat data sources: Customers want to use "Smart Scheduling" where the placement decision is informed by trusted external systems within the Red Hat ecosystem. For example, a "Risk Score" or "Compliance Score" generated by Red Hat Lightspeed (Insights) or OpenShift AI could be fetched by the Scoring API and used to inform placement on the best clusters. Tying in with Red Hat ecosystem tooling allows for easy integration with Red Hat’s OpenShift Cluster Manager console (console.redhat.com) taking advantage of and enhancing the functionality already provided in their subscription.
- Integration with 3rd party or Custom AI & LLM-Driven Optimization: Customers want to leverage external Large Language Models (LLMs) or predictive agents to drive "Smart Scheduling." For example, a Scoring API could fetch a "Risk Score" generated by an LLM analyzing complex system logs, or use a predictive model to identify clusters forecasted to have low utilization in the near future. This allows the Placement API to target clusters based on predictive analysis rather than just reactive metrics.
- Integration with MCP servers: Customers are increasingly using natural language and "chatbots" to interact with their multicluster management requirements. The ability to integrate an MCP server of their choice allows a human to interact with the Dynamic Scoring add-on more easily.
- Data Freshness vs. Hub Performance: Customers need accurate, up-to-the-minute scoring without overwhelming the Hub cluster with raw metric data streams. This framework solves this by processing data on the spoke (managed cluster) and only sending the calculated score to the Hub.
List any affected packages or components.
TBC