Overview:
Currently the timeout for image prefetching is purely heuristic, and the same for all images. We follow a very simple exponential pattern: first download has a 30s timeout, next one 1 minute, then 2 minutes and so on
The problem is that some images are tiny (so we waste time waiting too long on stalled downloads), and others large, and cannot possibly be fetched in less than 30 seconds (so we waste the first one or two attempts).
One thing that would maximize the robustness of prefetching would be to add a peer-to-peer communication between prefetcher pods such that they can exchange information on how long a successful prefetching of a given image took.
This way each prefetcher pod would be able to fine-tune the timeout for each image separately, and thus take more download attempts rather than keep idle while a given stalled download sits there waiting for an unrealistically-high deadline calculated by exponential backoff.
In the failed CI job for the linked ticket, this could have even doubled the number of attempts, increasing the probability of success.
Acceptance Criteria:
A list of specific needs or objectives that this task must deliver in order to be considered complete. Complete during Refinement status.