-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
None
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
The syncMachineSets in the MachinePool controller is going to be somewhat inefficient at scale, as it will iterate #remoteMS * #generatedMS times. #generatedMS should always be fairly small given we're reconciling a single MachinePool – usually at most equal to the number of AZs (aka failure domains) on the spoke region. However, as currently written, #remoteMS is all the MachineSets on the spoke, which is generally O(#mpools * #msPerPool). In "normal" circumstances, #mpools is single-digit. However, the use case we're seeing that led to #ITN-2024-00101 is boosting this to tens or hundreds. At this scale, the number of iterations of this loop can get into the thousands, which can start to matter on a busy hive.
To mitigate the cost of this func, we can try a couple of things:
- Make the func more efficient algorithmically (HIVE-2538)
- Minimize the number of objects being processed (this card).
See thread for background. Compare and contrast with HIVE-2540, which aims to reduce the total amount of network traffic via caching; whereas this card aims to reduce the number of objects retrieved for use by syncMachineSets.
The solution would be as simple as adding a filter here that matches our machine-pool name label to the MachinePool being processed. This should work because we subsequently match to generated MachineSets based on that label. However, we'll also have to figure out a way to discover the network in GCP, which we currently do by introspecting a random remote MachineSet. We know this will always work because the spoke will always have at least one MachineSet for workers. I wonder if, instead, we can use the master machine like we do in AWS to default the AMI.