-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
Product / Portfolio Work
-
False
-
-
False
-
None
-
None
-
None
-
None
Ref. Symptoms / observations / job analysis feature
Context: This step focuses on automating the process of applying symptom labels to job runs. TRT-2381 updated our cloud function to do this as files are written to the bucket. This story is to follow up on deferred TODOs from that one.
Action Items:
Within the job intake cloud function (ci-data-loader):
- // TODO: distinguish retryable errors and don't swallow this one if it's a reader problem we can retry
For the symptom label loader, error-handling is a little haphazard, erring on the side of logging but not returning an error so as not to cause cloud function retries. But what we really want is to distinguish between errors that it makes sense to retry (e.g. failed to read from the bucket) and those which don't (file content didn't parse). Create an error wrapper with this distinction, return the error up the stack, and then at the top level decide whether to retry or not (if other loaders claimed the file, don't ask for a retry, just log the error and live with the missing symptom analysis; otherwise, retry what is retry-able). - pkg/common/event.go: // TODO: add metadata from the variant registry and anything else we might filter on
Import the variant registry from sippy (this may involve caching something), apply it to the job name, and record the variants in DerivedJobRunData. Derive release status information from this and include. This is probably the last metadata that will make sense to add here, until we implement compound symptoms. - pkg/loader/symptoms/symptom_matcher.go: // TODO: check event characteristics against symptom applicability filters
Having filled out DerivedJobRunData, use the contents to filter which symptoms could apply (before actually reading files). Ideally we would extend the symptom definition to enable specifying variant filters, but that may be complicated and it would be good to start with at least the existing time and release-related filters. - Cache DerivedJobRunData per job run in a LRU cache (pkg.go.dev/github.com/db47h/cache/lru; cache is per-instance so this optimization is probably not high-priority but it seems prudent as symptoms grow and more files per job are inspected)
- Add a mutex around cache read/refresh such that if we ever enable concurrency within a cloud instance, it still works.
- Enable using sippy-auth as the backend instead of plain sippy (see this comment for tips).
As much as possible, logic for determining labels and writing them should be located in the sippy codebase (and imported by the cloud function), to facilitate reuse by other tools doing the same things in different contexts.
- split from
-
TRT-2381 Automate applying symptom labels in BQ job_labels
-
- Closed
-