Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-12986

Potential reconciliation failure for rhods-prometheus-operator during 1.33-2.4 upgrade in self-managed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Normal Normal
    • None
    • None
    • None
    • None
    • False
    • None
    • False
    • Testable
    • No
    • No
    • No
    • Pending
    • None

      Description of problem:

      A potential reconciliation error has been encountered during the 1.33 to 2.4 upgrade in two clusters (disconnected cluster, PSI QE cluster).

      2023-11-23T09:45:37Z    ERROR    Reconciler error    {"controller": "datasciencecluster", "controllerGroup": "datasciencecluster.opendatahub.io", "controllerKind": "DataScienceCluster", "DataScienceCluster": {"name":"default-dsc"}, "namespace": "", "name": "default-dsc", "reconcileID": "0c1a32ca-7ffd-4310-8259-f6baabf3c868", "error": "1 error occurred:\n\t* Deployment.apps \"rhods-prometheus-operator\" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\"app.kubernetes.io/part-of\":\"model-mesh\", \"app.opendatahub.io/model-mesh\":\"true\", \"k8s-app\":\"rhods-prometheus-operator\"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable\n\n"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
          /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
          /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
          /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
      2023-11-23T09:45:37Z    DEBUG    events    DataScienceCluster instance default-dsc created, but have some failures in component 1 error occurred:
          * Deployment.apps "rhods-prometheus-operator" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/part-of":"model-mesh", "app.opendatahub.io/model-mesh":"true", "k8s-app":"rhods-prometheus-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable 

      Further attempts at reproducing the same issue have been unsuccessful in three different clusters and we are currently not aware of what might have triggered the issue to appear in the two environments in the first place.

      Prerequisites (if any, like setup, operators/versions):

      Upgrading from RHODS 1.33 to RHODS 2.x

      Steps to Reproduce

      Unknown

      Actual results:

      Reconciliation error appears in the operator pod logs / DSC conditions and the rhods-prometheus-operator deployment is not upgraded correctly.

      Expected results:

      No reconciliation error, deployment is upgraded correctly.

      Reproducibility (Always/Intermittent/Only Once):

      Twice (specific clusters). Further attempts to reproduce have been unsuccessful.

      Build Details:

      RHODS 1.33 / RHODS 2.4 RC3

      Workaround:

      If the issue is encountered, we've seen that disabling and then re-enabling the modelmesh component has fixed it in the disconnected cluster - however it is not clear why this has been the case.
      In the PSI QE cluster we've confirmed that restarting the rhods operator pod has fixed the issue.

      Additional info:

      Unable to reproduce again, so I'm setting this Jira to normal priority - if this were reproducible it would however be a blocker issue.

              Unassigned Unassigned
              rhn-support-lgiorgi Luca Giorgi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: