Uploaded image for project: 'Observability and Data Analysis Program'
  1. Observability and Data Analysis Program
  2. OBSDA-208

Red Hat OpenShift distributed tracing data collection keeps reinstalling

    XMLWordPrintable

Details

    • Bug
    • Status: New
    • Critical
    • Resolution: Unresolved
    • OpenShift 4.10z
    • None
    • PM Tracing
    • None
    • False
    • None
    • False

    Description

      Red Hat OpenShift distributed tracing data collection keeps reinstalling in OCP 4.10 cluster

      After every couple of minutes messages like below are seen in the OLM operator pod logs and the reinstall starts. The reinstall completes successfully in a few seconds. However, the reinstall starts again in some time and it keeps repeating. Uninstalling and installing the operator back didn't help.
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      time="2022-09-15T08:47:05Z" level=warning msg="unhealthy component: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment \"opentelemetry-operator-controller-manager\" not available: Deployment does not have minimum availability." csv=opentelemetry-operator.v0.56.0-1 id=vFOjy namespace=openshift-operators phase=Succeeded strategy=deployment
      I0915 08:47:05.826465 1 event.go:282] Event(v1.ObjectReference

      {Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875049442", FieldPath:""}

      ): type: 'Warning' reason: 'ComponentUnhealthy' installing: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment "opentelemetry-operator-controller-manager" not available: Deployment does not have minimum availability.
      time="2022-09-15T08:47:06Z" level=warning msg="needs reinstall: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment \"opentelemetry-operator-controller-manager\" not available: Deployment does not have minimum availability." csv=opentelemetry-operator.v0.56.0-1 id=SEDcP namespace=openshift-operators phase=Failed strategy=deployment
      I0915 08:47:06.156419 1 event.go:282] Event(v1.ObjectReference

      {Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875061474", FieldPath:""}

      ): type: 'Normal' reason: 'NeedsReinstall' installing: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment "opentelemetry-operator-controller-manager" not available: Deployment does not have minimum availability.
      time="2022-09-15T08:47:06Z" level=info msg="scheduling ClusterServiceVersion for install" csv=opentelemetry-operator.v0.56.0-1 id=rgB1a namespace=openshift-operators phase=Pending
      I0915 08:47:06.367057 1 event.go:282] Event(v1.ObjectReference

      {Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875061502", FieldPath:""}

      ): type: 'Normal' reason: 'AllRequirementsMet' all requirements found, attempting install
      time="2022-09-15T08:47:06Z" level=warning msg="reusing existing cert opentelemetry-operator-controller-manager-service-cert"
      I0915 08:47:06.731470 1 event.go:282] Event(v1.ObjectReference

      {Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875061522", FieldPath:""}

      ): type: 'Normal' reason: 'InstallSucceeded' waiting for install components to report healthy
      time="2022-09-15T08:47:07Z" level=info msg="install strategy successful" csv=opentelemetry-operator.v0.56.0-1 id=YETBt namespace=openshift-operators phase=Installing strategy=deployment
      I0915 08:47:07.355798 1 event.go:282] Event(v1.ObjectReference

      {Kind:"ClusterServiceVersion", Namespace:"openshift-operators", Name:"opentelemetry-operator.v0.56.0-1", UID:"cae3195d-7d0a-4aa4-ba79-d8869e5f5888", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"875061568", FieldPath:""}

      ): type: 'Normal' reason: 'InstallWaiting' installing: waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment "opentelemetry-operator-controller-manager" not available: Deployment does not have minimum availability.
      time="2022-09-15T08:47:08Z" level=info msg="install strategy successful" csv=opentelemetry-operator.v0.56.0-1 id=jYIF+ namespace=openshift-operators phase=Installing strategy=deployment
      time="2022-09-15T08:47:09Z" level=info msg="install strategy successful" csv=opentelemetry-operator.v0.56.0-1 id=yc9DF namespace=openshift-operators phase=Installing strategy=deployment

      waiting for deployment opentelemetry-operator-controller-manager to become ready: deployment "opentelemetry-operator-controller-manager" not available: Deployment does not have minimum availability.
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      It was investigated by OLM team under https://issues.redhat.com/browse/OCPBUGS-1407 and below is their findings.

      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      I investigated the 30Sep m-g and it looks like it's not an OLM issue, but a problem with the specific operator. For e.g. in registry-redhat-io-openshift4-ose-must-gather-sha256-1902d6dc97c9f6b385e118786cf9c1100af5c26a549deda70ff7e99d171 9d80e/namespaces/openshift-logging/operators.coreos.com/clusterserviceversions/opentelemetry-operator.v0.56.0-1.yaml it shows that the pod is never passing its healthcheck and so OLM is trying consistently to reinstall it, in a never-ending loop AllRequirementsMet --> InstallSucceeded --> InstallWaiting --> InstallCheckFailed (minimum avail) --> NeedsReinstall

      There are prometheus logs showing active alerts for both

      opentelemetry-operator-controller-manager failing to reach expected replica count in 15 minutes
      OOM errors exceeding 5 in 15 minutes
      Since this isn't a full must-gather and only has OLM resources in it, I'm left to speculate that the opentelemetry-operator-controller-manager pod is exceeding its 64MiB memory limit and being killed; or that the pod is failing to respond to its healthcheck and being killed/reinstalled.
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      Operator version - 0.56.0-1
      OCP version - 4.10.21

      Logs are uploaded here -https://drive.google.com/drive/folders/19UFtC0FvZm6vwzRKUbBHrFMkqh5JKH_r?usp=sharing

      Attachments

        Activity

          People

            scohen1@redhat.com Sean Cohen
            rhn-support-alosingh Alok Singh
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: