-
Spike
-
Resolution: Done
-
Major
-
None
-
None
-
Product / Portfolio Work
-
3
-
False
-
-
False
-
-
-
-
None
Value Statement
Through collaborative innovation with APAC CTO and Edge Tech CoP, we've done a prototype to onboard Federated Learning into ACM. Leveraging ACM's existing architecture and APIs, we can easily support deploying FL runtimes to multi-cluster env, dispatching training workload to remote clusters for local training, and aggregating trained parameters back.
In next step, one of our goal is to seek for potential users that can collaborate together to put the solution into real use and continuous enhance feature based on feedback, in order to accelerate the solution evolving to production level.
The first potential user is Professor Bahman from Western Sydney University. In the two use cases his lab is currently working on, FL is to used improve AI efficiency in apps for Satellites Space Situational Awareness and Natural Disaster Management. Technical challenges regarding FL include (energy-aware and low latency):
- communication overhead (both the satellite and the drones for disaster data collection have limited time to transport the trained parameters back to central training side for aggregation)
- Footprint and energy consumption
- Should support reporting back metrics for accuracy and performance evaluation, e.g. training related matrics like computation time, accuracy, train round; resource usage like the power consumption, etc
- Ensure the demo can run locally and provide clear setup guidelines - Need support NodePort communication between the server and clients
Another use case is for 'retail' with respect to a franchise/dealer store management topology, requirements for the FL platform include:
- be edge ready/friendly
- easy for dealers & francise setups
- FL framework (flower, openFL, FLARE) independent
- support many different segmentations
- support different FL participants per segmentation, updating different models
- full-scale experimentation necessary across segmentations to achieve best model accuracy/performance
- test different segmentations and provide metrics on model accuracy.
Definition of Done for Engineering Story Owner (Checklist)
- ...
Development Complete
- The code is complete.
- Functionality is working.
- Any required downstream Docker file changes are made.
Tests Automated
[ ] Unit/function tests have been automated and incorporated into the
build.[ ] 100% automated unit/function test coverage for new or changed APIs.
Secure Design
[ ] Security has been assessed and incorporated into your threat model.
Multidisciplinary Teams Readiness
[ ] Create an informative documentation issue using the [Customer
Portal_doc_issue template](
https://github.com/stolostron/backlog/issues/new?assignees=&labels=squad%3Adoc&template=doc_issue.md&title=),
and ensure doc acceptance criteria is met. Link the development issue to
the doc issue.[ ] Provide input to the QE team, and ensure QE acceptance criteria
(established between story owner and QE focal) are met.
Support Readiness
[ ] The must-gather script has been updated.