Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13.z, 4.14.z, 4.15.z
Component/s: OLM
Labels:
None

Severity:
Critical
Regression:
No
Sprint:
Rasputin OLM Sprint 252
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The certified-operators crashed, but OLM failed to remove it automatically due to its `status.phase` still output `running`.

jiazha-mac:~ jiazha$ oc get pods 
NAME                                                              READY   STATUS             RESTARTS       AGE
...
certified-operators-jxnpp                                         0/1     CrashLoopBackOff   35 (47s ago)   5d14h
community-operators-n55vz                                         1/1     Running            1              5d14h
marketplace-operator-fc999f7db-p8wgs                              1/1     Running            2 (154m ago)   5d15h
redhat-marketplace-45mcm                                          1/1     Running            1              5d14h
redhat-operators-mpvzm                                            1/1     Running            1              5d14h
 
jiazha-mac:~ jiazha$ omg get pods certified-operators-jxnpp  -o yaml |grep phase
        f:phase: {}
  phase: Running

So, we should NOT use the `pod.Status.Phase` to judge the pod status.

https://github.com/openshift/operator-framework-olm/blob/master/staging/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go#L613

// podFailed checks whether the pod status is in a failed or unknown state, and deletes the pod if so.
func (c *GrpcRegistryReconciler) podFailed(pod *corev1.Pod) (bool, error) {
    if pod.Status.Phase == corev1.PodFailed || pod.Status.Phase == corev1.PodUnknown {
        logrus.WithField("UpdatePod", pod.GetName()).Infof("catalog polling result: update pod %s failed to start", pod.GetName())
        err := c.removePods([]*corev1.Pod{pod}, pod.GetNamespace())
        if err != nil {
            return true, errors.Wrapf(err, "error deleting failed catalog polling pod: %s", pod.GetName())
        }
        return true, nil
    }
    return false, nil
}

Version-Release number of selected component (if applicable):

    4.15.2

How reproducible:

    always

Steps to Reproduce:

    1. Install OCP cluster.
    2. Make a catalog source pod crash.
    3. check if its pod removed.

Actual results:

The crashed pod is still there.

Expected results:

The crashed pod should be removed automatically.

Additional info:

The must-gather log: https://drive.google.com/file/d/16_tFq5QuJyc_n8xkDFyK83TdTkrsVFQe/view?usp=drive_link

is related to

OCPBUGS-31391 The certified operator crash due to computed digest is different from the cache digest

Closed

Assignee:: Bryce Palmer

Reporter:: Jian Zhang

QA Contact:: Jian Zhang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/03/27 11:06 AM

Updated:: 2024/04/26 1:12 AM

Resolved:: 2024/04/24 3:48 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates