-
Bug
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
False
-
-
False
-
-
Problem
The wheels collection task (backend/collector/tasks/wheels.py) is experiencing two related issues:
- High RAM consumption: The collector accumulates all artifacts and drops in memory before committing to the database
- Late database commits: Database writes only happen at the very end of the task, meaning partial progress is lost if the collector crashes
Root Cause
The WheelsCollector.collect_wheels() method (in backend/collector/core/wheels_collector.py:491-573) accumulates ALL artifacts and drops in memory:
all_artifacts = [] all_drops = [] h1. Loops through ALL releases, appending to these lists
For products with hundreds of releases, each with multiple architectures, this creates thousands of objects in memory. Each artifact contains large JSON fields:
- dependency_graph (full package dependency graph)
- constraints_file (all packages with versions)
- build_sequence_summary (build metadata)
These fields can be 50-200KB+ per artifact, leading to significant memory consumption when processing all releases at once.
Impact
- OOM (Out of Memory) errors in the wheels-collector Job
- Increased resource limits needed (currently 512Mi/1000m for wheels collectors)
- Lost work if collector crashes before final commit
- No partial progress saved during long-running collections
Proposed Solution
Implement batch processing by release instead of accumulating everything:
- Add a batch_callback parameter to collect_wheels() that commits artifacts/drops after each release
- Update sync_wheels_collections_task() to provide a callback that performs bulk writes per-release
- This reduces peak memory usage from "all releases" to "single release" (typically 4-8 artifacts per release for different architectures)
- Enables incremental commits so partial progress is saved
Files to Modify
- backend/collector/core/wheels_collector.py: Add batch_callback support
- backend/collector/tasks/wheels.py: Implement per-release bulk write callback
- backend/collector/tests/test_tasks.py: Update tests for new batch behavior
Benefits
- Reduced peak memory usage (10-50x reduction depending on number of releases)
- Partial progress saved (database updated after each release)
- Lower resource limits needed in dispatcher.py
- More resilient to crashes and timeouts