Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.17
Component/s: OLM
Labels:
- olmv0

Target Version:
None
Activity Type:
Quality / Stability / Reliability
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

Description of problem:

Subscription processing issues are hard to debug

Version-Release number of selected component (if applicable):

How reproducible:

When there are issues of this type, the debugging hardship is 100% there.

Steps to Reproduce:

    1. Have a slow cluster.
    2. Create a a CatalogSource, OperatorGroup and Subscription to install an operator
    3. Wait for the OLM Job to time out

Actual results:

1. The hex-string-named pod is gone, so it's not possible at this point to figure out what it was stuck on for 10 minutes
2. The Conditions in kubectl describe subsription are a mess. All lumped together so it's hard to see which field applies to which one, and it's not possible to see which ones constitute the current state and which ones are stale.
3. It is not possible to extend the deadline of the job nor have it retain the pod for inspection.

Expected results:

1. Clear, actionable status information about the cause for failure, down to the root cause.
2. Ability to tweak things to extend deadline, retain pods, etc.

Additional info:

In the case I'm facing the operator index pod looks healthy, already 2 minutes after start, and the CatalogSource status agrees.

Conditions on the subscription resource are... unclear. Stale? How can the sources be all healthy and one of them unreachable at the same time?

  Conditions:
    Last Transition Time:  2025-07-24T09:29:28Z
    Message:               all available catalogsources are healthy
    Reason:                AllCatalogSourcesHealthy
    Status:                False
    Type:                  CatalogSourcesUnhealthy
    Message:               error using catalogsource stackrox-operator/stackrox-operator-test-index: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 172.30.4.208:50051: connect: connection refused"
    Reason:                ErrorPreventedResolution
    Status:                True
    Type:                  ResolutionFailed
    Reason:                UnpackingInProgress
    Status:                True
    Type:                  BundleUnpacking
    Message:               bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline
    Reason:                BundleUnpackFailed
    Status:                True
    Type:                  BundleUnpackFailed

the job decided to remove the pod after 10 minutes:  Type     Reason            Age   From            Message
  ----     ------            ----  ----            -------
  Normal   SuccessfulCreate  14m   job-controller  Created pod: 9ac57f1c7f9b5705fa3ec9f16aed4e3b7ed23d28d5729f0bea6aeb146fqhtmd
  Normal   SuccessfulDelete  4m7s  job-controller  Deleted pod: 9ac57f1c7f9b5705fa3ec9f16aed4e3b7ed23d28d5729f0bea6aeb146fqhtmd
  Warning  DeadlineExceeded  4m7s  job-controller  Job was active longer than specified deadline

Actual logs.

Slack thread.

Assignee:: Marina Kalinin

Reporter:: Marcin Owsiany

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/08/06 4:44 AM

Updated:: 2025/10/21 4:36 PM

Target start:: None

Target end:: None

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates