-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
Support dumping incidents from DW
-
5
-
False
-
-
False
-
To Do
-
rhel-arr-cki
Support dumping DW incidents ("issue occurrences" in DW parlance) and their requisites as a KCIDB dataset, detailed enough to support training the Kwai model.
- [ ] Decide on a way to represent the required data in KCIDB I/O
- Support dumping a timestamp range of data, so we could (re)dump in chunks.
- Download the requisite log files and cache them in an LZMA-compressed ZIP file (similarly as was done so far for Kwai), additionally/optionally access an arbitrary list of similar ZIP files for already-downloaded files (as was already done too).
- We'll have to have the following objects in the dump: `incidents` (d'uh!), `issues` (to represent culprits/labels), `tests` (to represent output files and "test paths"), `builds` (to represent architectures). The `checkouts` might not be strictly necessary, but could be useful for debugging, and in theory in the future.
- Add `evidence` array attribute to `incidents` (in `misc` for the start), listing all the places in the linked object's log files pointing to the issue occurring. To match DW logic each item in the array alone signifies the occurrence (in DW any one regex match is sufficient). And we're only describing output file contents matches for now.
- DW has more precise (even if a bit haphazard) culprit identification than KCIDB. Put extra classification under `misc` to augment e.g. `harness` when it's specified. At least for the start.
- [x] Implement querying all the necessary data from DW
- A temporary table of filtered incidents, and perhaps something else, and separate queries for separate object types seem to be in order.
- [ ] Implement issue matching logic, producing KCIDB data
- Suffer for now, but get rid of this in favor of simply dumping data generated by https://gitlab.com/cki-project/datawarehouse/-/issues/587
- [ ] Implement generating and dumping the KCIDB dataset representation
Jira: CKI-6405
Jira: CKI-7136