Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
None

Activity Type:
Product / Portfolio Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
job symptoms
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

Ref. Symptoms / observations / job analysis feature

Context: This step focuses on automating the process of applying symptom labels to job runs. ~~TRT-2381~~ updated our cloud function to do this as files are written to the bucket. This story is to follow up on deferred TODOs from that one.

Action Items:

Within the job intake cloud function (ci-data-loader):

// TODO: distinguish retryable errors and don't swallow this one if it's a reader problem we can retry
For the symptom label loader, error-handling is a little haphazard, erring on the side of logging but not returning an error so as not to cause cloud function retries. But what we really want is to distinguish between errors that it makes sense to retry (e.g. failed to read from the bucket) and those which don't (file content didn't parse). Create an error wrapper with this distinction, return the error up the stack, and then at the top level decide whether to retry or not (if other loaders claimed the file, don't ask for a retry, just log the error and live with the missing symptom analysis; otherwise, retry what is retry-able).
pkg/common/event.go: // TODO: add metadata from the variant registry and anything else we might filter on
Import the variant registry from sippy (this may involve caching something), apply it to the job name, and record the variants in DerivedJobRunData. Derive release status information from this and include. This is probably the last metadata that will make sense to add here, until we implement compound symptoms.
pkg/loader/symptoms/symptom_matcher.go: // TODO: check event characteristics against symptom applicability filters
Having filled out DerivedJobRunData, use the contents to filter which symptoms could apply (before actually reading files). Ideally we would extend the symptom definition to enable specifying variant filters, but that may be complicated and it would be good to start with at least the existing time and release-related filters.
Cache DerivedJobRunData per job run in a LRU cache (pkg.go.dev/github.com/db47h/cache/lru; cache is per-instance so this optimization is probably not high-priority but it seems prudent as symptoms grow and more files per job are inspected)
Add a mutex around cache read/refresh such that if we ever enable concurrency within a cloud instance, it still works.
Enable using sippy-auth as the backend instead of plain sippy (see this comment for tips).

As much as possible, logic for determining labels and writing them should be located in the sippy codebase (and imported by the cloud function), to facilitate reuse by other tools doing the same things in different contexts.

split from

TRT-2381 Automate applying symptom labels in BQ job_labels

Closed

Assignee:: Unassigned

Reporter:: Luke Meyer

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2026/01/12 9:21 PM

Updated:: 2026/01/16 2:00 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates