-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
OSDFM is a service team focused on service reliability and availability as our top priorities.{}
Currently, OSDFM team members are responsible for monitoring two main categories of information on a daily basis:
- Production Environment Monitoring:{}We track the status of every region in the production environment using several metrics. This includes:
- The status of all Service Clusters in each region
- The status of all Management Clusters under each SC
- The state of HCPs on each MC/SC/Region, including metrics such as HCP count, installation time SLO, etc
- Pipeline Alert Monitoring (Integration & Stage & Production):{}OSDFM manages multiple pipelines used for deployment in both Stage and Production environments as well as testing in Integration and Stage environments. Any failures in these pipelines trigger alerts that require manual investigation by the OSDFM team.
To reduce the operational effort on the team, we plan to develop an AI-powered bot with the following functionality:
- Slack Integration:{}All metric anomalies and pipeline alerts will be forwarded to the #osd-fm-alerts Slack channel.
- AI Monitoring & Summarization:{}The bot will continuously monitor alerts in #osd-fm-alerts, intelligently analyze and summarize the information, and then post a single consolidated message in the #wg-osd-fleet-manager channel.
This will help the team stay informed without being overwhelmed by alert noise, while maintaining visibility and rapid response for critical issues.