Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: OpenStack 18.0 Observability Services 1.0
Component/s: PM Tracing
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Red Hat OpenShift distributed tracing data collection keeps reinstalling in OCP 4.10 cluster

After every couple of minutes messages like below are seen in the OLM operator pod logs and the reinstall starts. The reinstall completes successfully in a few seconds. However, the reinstall starts again in some time and it keeps repeating. Uninstalling and installing the operator back didn't help.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
time="2022-09-15T08:47:05Z" level=warning msg="unhealthy component: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment \"opentelemetry-operator-controller-manager\" not available: Deployment does not have minimum availability." csv=opentelemetry-operator.v0.56.0-1 id=vFOjy namespace=openshift-operators phase=Succeeded strategy=deployment
I0915 08:47:05.826465 1 event.go:282] Event(v1.ObjectReference

{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875049442", FieldPath:""}

): type: 'Warning' reason: 'ComponentUnhealthy' installing: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment "opentelemetry-operator-controller-manager" not available: Deployment does not have minimum availability.
time="2022-09-15T08:47:06Z" level=warning msg="needs reinstall: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment \"opentelemetry-operator-controller-manager\" not available: Deployment does not have minimum availability." csv=opentelemetry-operator.v0.56.0-1 id=SEDcP namespace=openshift-operators phase=Failed strategy=deployment
I0915 08:47:06.156419 1 event.go:282] Event(v1.ObjectReference

{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875061474", FieldPath:""}

): type: 'Normal' reason: 'NeedsReinstall' installing: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment "opentelemetry-operator-controller-manager" not available: Deployment does not have minimum availability.
time="2022-09-15T08:47:06Z" level=info msg="scheduling ClusterServiceVersion for install" csv=opentelemetry-operator.v0.56.0-1 id=rgB1a namespace=openshift-operators phase=Pending
I0915 08:47:06.367057 1 event.go:282] Event(v1.ObjectReference

{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875061502", FieldPath:""}

): type: 'Normal' reason: 'AllRequirementsMet' all requirements found, attempting install
time="2022-09-15T08:47:06Z" level=warning msg="reusing existing cert opentelemetry-operator-controller-manager-service-cert"
I0915 08:47:06.731470 1 event.go:282] Event(v1.ObjectReference

{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875061522", FieldPath:""}

): type: 'Normal' reason: 'InstallSucceeded' waiting for install components to report healthy
time="2022-09-15T08:47:07Z" level=info msg="install strategy successful" csv=opentelemetry-operator.v0.56.0-1 id=YETBt namespace=openshift-operators phase=Installing strategy=deployment
I0915 08:47:07.355798 1 event.go:282] Event(v1.ObjectReference

{Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875061568", FieldPath:""}

): type: 'Normal' reason: 'InstallWaiting' installing: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment "opentelemetry-operator-controller-manager" not available: Deployment does not have minimum availability.
time="2022-09-15T08:47:08Z" level=info msg="install strategy successful" csv=opentelemetry-operator.v0.56.0-1 id=jYIF+ namespace=openshift-operators phase=Installing strategy=deployment
time="2022-09-15T08:47:09Z" level=info msg="install strategy successful" csv=opentelemetry-operator.v0.56.0-1 id=yc9DF namespace=openshift-operators phase=Installing strategy=deployment

waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment "opentelemetry-operator-controller-manager" not available: Deployment does not have minimum availability.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It was investigated by OLM team under https://issues.redhat.com/browse/OCPBUGS-1407 and below is their findings.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I investigated the 30Sep m-g and it looks like it's not an OLM issue, but a problem with the specific operator. For e.g. in registry-redhat-io-openshift4-ose-must-gather-sha256-1902d6dc97c9f6b385e118786cf9c1100af5c26a549deda70ff7e99d171 9d80e/namespaces/openshift-logging/operators.coreos.com/clusterserviceversions/opentelemetry-operator.v0.56.0-1.yaml it shows that the pod is never passing its healthcheck and so OLM is trying consistently to reinstall it, in a never-ending loop AllRequirementsMet --> InstallSucceeded --> InstallWaiting --> InstallCheckFailed (minimum avail) --> NeedsReinstall

There are prometheus logs showing active alerts for both

opentelemetry-operator-controller-manager failing to reach expected replica count in 15 minutes
OOM errors exceeding 5 in 15 minutes
Since this isn't a full must-gather and only has OLM resources in it, I'm left to speculate that the opentelemetry-operator-controller-manager pod is exceeding its 64MiB memory limit and being killed; or that the pod is failing to respond to its healthcheck and being killed/reinstalled.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Operator version - 0.56.0-1
OCP version - 4.10.21

Logs are uploaded here -https://drive.google.com/drive/folders/19UFtC0FvZm6vwzRKUbBHrFMkqh5JKH_r?usp=sharing

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates