-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
OpenShift AI as a workload running in OpenShift on OpenStack (dev-preview)
-
BU Product Work
-
False
-
None
-
False
-
Not Selected
-
To Do
-
OCPSTRAT-1383 - Feature - OpenShift AI on OpenStack (dev-preview)
-
33% To Do, 0% In Progress, 67% Done
-
L
Goal
Running OpenShift AI as a workload running in OpenShift on OpenStack combines the best of both worlds: the robust, scalable infrastructure of OpenStack with the powerful container orchestration and AI/ML capabilities of OpenShift. This setup offers a flexible, secure, and efficient environment for deploying and managing AI workloads, enabling organizations to innovate rapidly while maintaining control over their infrastructure.
It would also provide a serious alternative to customers interested to migrate their infrastructure from VMware to OpenStack with lower licensing costs (see this article for background).
Why is this important?
Scalability and Flexibility
OpenStack provides a highly scalable and flexible infrastructure as a service (IaaS) platform, which can dynamically allocate resources as needed. OpenShift AI, running on OpenStack, can take advantage of this scalability to handle varying AI workloads efficiently. This flexibility allows for easy scaling of computational resources, storage, and networking to meet the demands of AI models, which often require significant resources for training and inference.
Resource Optimization
OpenShift, as a Kubernetes-based platform, excels in container orchestration, enabling efficient resource utilization and management. When combined with OpenStack, OpenShift AI can optimize the usage of underlying infrastructure resources through containerization and orchestration, ensuring that workloads are running efficiently and cost-effectively.
Enhanced Security
Both OpenStack and OpenShift offer robust security features. OpenStack provides secure multi-tenancy, network isolation, and fine-grained access controls. OpenShift adds to this with features like role-based access control (RBAC), security contexts, and integrated CI/CD pipelines that can include security checks. Running AI workloads on this combined platform can thus benefit from enhanced security, which is crucial for sensitive data and applications.
Hybrid Capabilities
This hybrid capability allows organizations to leverage the best of both IaaS and PaaS cloud resources, optimizing costs, performance, and compliance.
AI workloads can run on both VMs and containers, within the same infrastructure.
Infrastructure as Code and Automation
OpenStack and OpenShift both support infrastructure as code (IaC) and automation, which are critical for managing AI/ML workflows that require repeatable and consistent environments. Automation tools can provision infrastructure, deploy applications, and manage lifecycles, thereby reducing operational complexity and improving efficiency.
Support for Diverse Workloads
AI and ML workloads often involve diverse computing needs, from CPU-intensive tasks to GPU-accelerated processing. OpenStack's support for heterogeneous hardware, including GPUs, combined with OpenShift's capability to manage and schedule these resources efficiently, ensures that AI workloads can be optimized for performance and cost.
Community and Ecosystem
Both OpenStack and OpenShift benefit from strong open-source communities and ecosystems. This means access to a wealth of resources, plugins, integrations, and community support, which can accelerate development and deployment of AI solutions.
Vendor Independence
Using OpenStack and OpenShift allows organizations to avoid vendor lock-in, providing the freedom to choose from a variety of hardware, software, and service providers. This independence is particularly valuable for organizations looking to maintain flexibility and control over their AI infrastructure.
Compliance and Governance
Many industries have strict compliance and governance requirements. OpenStack's private cloud capabilities combined with OpenShift's operational controls can help organizations meet regulatory requirements by providing data sovereignty, control over data residency, and comprehensive auditing capabilities.
Scenarios
Retail Analytics and Personalized Marketing
RetailMart, a large international retail chain, wants to leverage artificial intelligence and machine learning to enhance its operations. The company aims to improve customer experience through personalized marketing, inventory management, and sales forecasting. RetailMart used to be a VMware customer and wants to deploy a new infrastructure on OpenStack to reduce licensing costs.
Challenges:
- Data Volume and Variety: Handling and processing vast amounts of heterogeneous data.
- Scalability: Need to scale resources up and down based on AI/ML training and inference workloads.
- Resource Optimization: Efficient use of compute and storage resources to minimize costs.
- Security and Compliance: Ensuring data privacy and meeting regulatory requirements.
- Hybrid Cloud Strategy: Utilizing both on-premises and cloud resources for optimal performance and cost.
Solution: Running OpenShift AI on OpenStack
Infrastructure Setup:
- OpenStack Deployment: RetailMart deploys OpenStack on its on-premises data centers to create a private cloud environment. OpenStack provides the necessary IaaS capabilities, allowing dynamic allocation of compute, storage, and networking resources. It would replace VMware product to manage the IaaS.
- OpenShift Installation: On top of the OpenStack infrastructure, RetailMart installs OpenShift to manage containerized applicationslike AI/ML workloads.
- Data Ingestion and Preprocessing
- Data Collection: Data from various sources (e.g., sales transactions, customer interactions) is collected and stored in Ceph Rados Gateway, the object storage
- Data Processing: Using OpenShift, RetailMart deploys containerized data processing applications (e.g., Apache Spark) to clean and preprocess the data. These applications can scale out to process large datasets efficiently.
Model Training:
- Resource Allocation: OpenShift AI dynamically allocates resources from OpenStack, including GPU instances for computationally intensive model training tasks.
- Training Jobs: Machine learning models (e.g., recommendation systems, demand forecasting models) are developed and trained using frameworks like TensorFlow and PyTorch. These models are containerized and run on OpenShift, which manages the scheduling and execution of these jobs.
- Autoscaling: During peak training periods, OpenShift can scale up additional compute resources from OpenStack to ensure timely completion of training jobs.
Monitoring and Management:
- Observability: OpenShift provides monitoring tools (e.g., Prometheus, Grafana) to track the performance of AI workloads and underlying infrastructure.
- Security: OpenShift and OpenStack's security features (e.g., RBAC, network policies, encryption) ensure that sensitive customer data is protected and compliance requirements are met.
Hybrid Cloud Integration:
- Bursting to other clouds: during seasonal peaks (e.g., Black Friday sales), RetailMart leverages OpenShift and OpenStack hybrid cloud capabilities to burst workloads to another OpenShift cluster integrated with OpenStack. This ensures that there are enough resources to handle the increased demand without over-provisioning the on-premises infrastructure.
5G Network Optimization and Predictive Maintenance
5GConnect, a leading telecommunications operator, is rolling out its 5G network across multiple regions. The company aims to provide high-speed, reliable connectivity to millions of customers, including individual consumers and enterprise clients. To ensure optimal performance and minimize downtime, 5GConnect plans to leverage AI and machine learning for network optimization and predictive maintenance.
Challenges:
- Network Complexity: Managing the complex and highly dynamic 5G network with numerous base stations and a diverse range of connected devices.
- Performance Optimization: Continuously optimizing network parameters to ensure high performance and low latency.
- Predictive Maintenance: Identifying potential issues and performing maintenance before failures occur to minimize downtime.
- Data Management: Handling and analyzing large volumes of real-time data from network devices and sensors.
- Scalability: Scaling AI/ML workloads to handle the dynamic demands of the network.
Solution: Running OpenShift AI on OpenStack
Infrastructure setup:
- OpenStack Deployment: 5GConnect deploys OpenStack in its data centers to create a flexible, scalable private cloud infrastructure. OpenStack provides the foundational IaaS capabilities needed for dynamic resource allocation. The platform already supports Fast Datapath features, like DPDK, SR-IOV, CPU partitioning, Hugepages in order to provide low-latency to the workloads.
- OpenShift Installation: On top of the OpenStack infrastructure, 5GConnect installs OpenShift to manage containerized AI/ML applications.
Data Collection and Ingestion:
- Sensor and Device Data: Data from various network devices, including base stations, routers, and IoT sensors, is collected in real-time. This data includes metrics like signal strength, bandwidth usage, latency, and hardware status.
- Storage Solutions: The data is stored in Ceph, ensuring scalable and reliable storage that can handle the massive influx of real-time data.
AI Model Training:
- Resource Allocation: OpenShift AI allocates resources from OpenStack, including GPU-accelerated instances, for training complex AI
- Training Workloads: Machine learning models for network optimization and predictive maintenance are developed and trained using frameworks like TensorFlow and PyTorch. These models are trained on historical network performance data and failure logs.
Network Optimization:
- Dynamic Parameter Adjustment: AI models analyze network performance data and recommend adjustments to parameters like power levels, frequency bands, and antenna configurations to optimize performance.
- Traffic Management: AI-driven traffic management algorithms are deployed to dynamically allocate network resources, ensuring high quality of service (QoS) for different types of traffic.
Predictive Maintenance:
- Anomaly Detection: AI models continuously monitor network data to detect anomalies that might indicate potential failures. When an anomaly is detected, alerts are generated for the maintenance team.
- Maintenance Scheduling: Predictive maintenance models recommend optimal times for maintenance activities, balancing the need to minimize downtime with operational efficiency.
Acceptance Criteria
- This is a dev-preview, so we'll provide documention, with possible workarounds.
- It might not be directly consumable by customers because of potential unsupported workarounds, but this will help to define a roadmap in order to support the use-cases.
- Out of this EPIC, we might create other EPICs for our team or feed backlogs for other teams, assuming something is missing.
- We don't expect to deliver code but this might be a stretch goal if time permits and if we actually need to patch something.
Dependencies (internal and external)
- none (so far).
Previous Work (Optional):
- relates to
-
OSPRH-10884 As an administrator, I want to deploy AI/ML model in OCP AI running on top of my OpenStack cloud
- Refinement