Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-19546

Enable Observability in OCM with Federated Learning PoC

XMLWordPrintable

    • Product / Portfolio Work
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Provide the required acceptance criteria using this template.

      • ...
      Show
      Provide the required acceptance criteria using this template. ...
    • None

      Value Statement

      Through collaborative innovation with APAC CTO and Edge Tech CoP, we've done a prototype to onboard Federated Learning into ACM. Leveraging ACM's existing architecture and APIs, we can easily support deploying FL runtimes to multi-cluster env, dispatching training workload to remote clusters for local training, and aggregating trained parameters back.

      In next step, one of our goal is to seek for potential users that can collaborate together to put the solution into real use and continuous enhance feature based on feedback, in order to accelerate the solution evolving to production level.

      The first potential user is Professor Bahman from Western Sydney University. In the two use cases his lab is currently working on,  FL is to used improve AI efficiency in apps for Satellites Space Situational Awareness  and Natural Disaster Management. Technical challenges regarding FL include (energy-aware and low latency):

      1. communication overhead (both the satellite and the drones for disaster data collection have limited time to transport the trained parameters back to central training side for aggregation)
      2. Footprint and energy consumption
      3. Should support reporting back metrics for accuracy and performance evaluation, e.g. training related matrics like computation time, accuracy, train round; resource usage like the power consumption, etc
      4. Ensure the demo can run locally and provide clear setup guidelines - Need support NodePort communication between the server and clients

      Another use case is for 'retail' with respect to a franchise/dealer store management topology, requirements for the FL platform include:

      • be edge ready/friendly
      • easy for dealers & francise setups
      • FL framework (flower, openFL, FLARE) independent
      • support many different segmentations
      • support different FL participants per segmentation, updating different models
      • full-scale experimentation necessary across segmentations to achieve best model accuracy/performance
      • test different segmentations and provide metrics on  model accuracy.

       

      Definition of Done for Engineering Story Owner (Checklist)

      • ...

      Development Complete

      • The code is complete.
      • Functionality is working.
      • Any required downstream Docker file changes are made.

      Tests Automated

      • [ ] Unit/function tests have been automated and incorporated into the
        build.
      • [ ] 100% automated unit/function test coverage for new or changed APIs.

      Secure Design

      • [ ] Security has been assessed and incorporated into your threat model.

      Multidisciplinary Teams Readiness

      Support Readiness

      • [ ] The must-gather script has been updated.

              rh-ee-myan Meng Yan
              yuhe@redhat.com Yuanyuan He
              Hui Chen Hui Chen
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: