Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-6691

TALM - image precaching fails at PreparingToStart when 1 spoke in a batch is unreachable - openshift-talo-pre-cache namespace is never created on the healthy spoke

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • 4.10.z
    • TALM Operator
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When 4.10 talm is used to precache multiple spokes in one batch, if one of the spokes is unavailable, then the precache for the healthy spoke also gets stuck at "PreparingToStart". And openshift-talo-pre-cache namespace is never created on spoke. 
      
      
      # In talm log, note that there are two spokes "worker-0" and "worker-1", but once an error occurred on worker-1 (powered off), it does not continue with worker-0.
      
      2023-01-26T17:02:41.786Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"PrecacheSpecFromPolicies": {"platformImage":"registry.kni-qe-16.lab.eng.rdu2.redhat.com:5000/openshift-release-dev/ocp-release@sha256:bda60323bbca48e3d1dae1154e074cc2cbda64889c7850d78dea28516c5675fa"}}
      2023-01-26T17:02:41.786Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"currentState": "NotStarted", "cluster": "worker-1"}
      2023-01-26T17:02:41.786Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"currentState": "NotStarted", "condition": "entry", "cluster": "worker-1", "nextState": "PreparingToStart"}
      2023-01-26T17:02:41.817Z    INFO    controllers.ClusterGroupUpgrade    [createResourcesFromTemplates]    {"cluster": "worker-1", "template": "precache-ns-delete"}
      I0126 17:02:42.871121       1 request.go:668] Waited for 1.042567232s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apps.open-cluster-management.io/v1?timeout=32s
      2023-01-26T17:02:46.485Z    INFO    controllers.ClusterGroupUpgrade    [createResourcesFromTemplates]    {"cluster": "worker-1", "template": "view-precache-namespace"}
      2023-01-26T17:02:51.149Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"previousState": "NotStarted", "nextState": "PreparingToStart", "cluster": "worker-1"}
      2023-01-26T17:02:51.149Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"currentState": "NotStarted", "cluster": "worker-0"}
      2023-01-26T17:02:51.149Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"currentState": "NotStarted", "condition": "entry", "cluster": "worker-0", "nextState": "PreparingToStart"}
      2023-01-26T17:02:51.170Z    INFO    controllers.ClusterGroupUpgrade    [createResourcesFromTemplates]    {"cluster": "worker-0", "template": "precache-ns-delete"}
      I0126 17:02:52.871687       1 request.go:668] Waited for 1.694950567s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/build.openshift.io/v1?timeout=32s
      2023-01-26T17:02:55.834Z    INFO    controllers.ClusterGroupUpgrade    [createResourcesFromTemplates]    {"cluster": "worker-0", "template": "view-precache-namespace"}
      2023-01-26T17:03:00.496Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"previousState": "NotStarted", "nextState": "PreparingToStart", "cluster": "worker-0"}
      2023-01-26T17:03:00.505Z    INFO    controllers.ClusterGroupUpgrade    Finish reconciling CGU    {"name": "talm-test/generated-cgu-multi-spokes-one-unavailable", "requeueAfter": 30}
      2023-01-26T17:03:30.505Z    INFO    controllers.ClusterGroupUpgrade    Start reconciling CGU    {"name": "talm-test/generated-cgu-multi-spokes-one-unavailable"}
      2023-01-26T17:03:30.605Z    INFO    controllers.ClusterGroupUpgrade    Loaded CGU    {"name": "talm-test/generated-cgu-multi-spokes-one-unavailable", "version": "34775179"}
      2023-01-26T17:03:30.606Z    INFO    controllers.ClusterGroupUpgrade    [reconcilePrecaching]    {"FindStatusCondition  PrecachingDone": "&Condition{Type:PrecachingDone,Status:False,ObservedGeneration:0,LastTransitionTime:2023-01-26 17:02:41 +0000 UTC,Reason:PrecachingNotDone,Message:Precaching is required and not done,}"}
      2023-01-26T17:03:30.606Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"currentState": "PreparingToStart", "cluster": "worker-1"}
      2023-01-26T17:03:30.609Z    ERROR    controllers.ClusterGroupUpgrade    reconcilePrecaching error    {"error": "[getPreparingConditions] no ManagedClusterView conditions found"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
      2023-01-26T17:03:30.609Z    INFO    controllers.ClusterGroupUpgrade    Finish reconciling CGU    {"name": "talm-test/generated-cgu-multi-spokes-one-unavailable", "requeueRightAway": false}
      2023-01-26T17:03:30.609Z    ERROR    controller-runtime.manager.controller.clustergroupupgrade    Reconciler error    {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "generated-cgu-multi-spokes-one-unavailable", "namespace": "talm-test", "error": "[getPreparingConditions] no ManagedClusterView conditions found"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
      2023-01-26T17:03:30.615Z    INFO    controllers.ClusterGroupUpgrade    Start reconciling CGU    {"name": "talm-test/generated-cgu-multi-spokes-one-unavailable"}
      2023-01-26T17:03:30.716Z    INFO    controllers.ClusterGroupUpgrade    Loaded CGU    {"name": "talm-test/generated-cgu-multi-spokes-one-unavailable", "version": "34775179"}
      2023-01-26T17:03:30.716Z    INFO    controllers.ClusterGroupUpgrade    [reconcilePrecaching]    {"FindStatusCondition  PrecachingDone": "&Condition{Type:PrecachingDone,Status:False,ObservedGeneration:0,LastTransitionTime:2023-01-26 17:02:41 +0000 UTC,Reason:PrecachingNotDone,Message:Precaching is required and not done,}"}
      2023-01-26T17:03:30.716Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"currentState": "PreparingToStart", "cluster": "worker-1"}
      2023-01-26T17:03:30.720Z    ERROR    controllers.ClusterGroupUpgrade    reconcilePrecaching error    {"error": "[getPreparingConditions] no ManagedClusterView conditions found"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
      2023-01-26T17:03:30.720Z    INFO    controllers.ClusterGroupUpgrade    Finish reconciling CGU    {"name": "talm-test/generated-cgu-multi-spokes-one-unavailable", "requeueRightAway": false}
      2023-01-26T17:03:30.720Z    ERROR    controller-runtime.manager.controller.clustergroupupgrade    Reconciler error    {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "generated-cgu-multi-spokes-one-unavailable", "namespace": "talm-test", "error": "[getPreparingConditions] no ManagedClusterView conditions found"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
      2023-01-26T17:03:30.731Z    INFO    controllers.ClusterGroupUpgrade    Start reconciling CGU    {"name": "talm-test/generated-cgu-multi-spokes-one-unavailable"}
      2023-01-26T17:03:30.831Z    INFO    controllers.ClusterGroupUpgrade    Loaded CGU    {"name": "talm-test/generated-cgu-multi-spokes-one-unavailable", "version": "34775179"}
      2023-01-26T17:03:30.831Z    INFO    controllers.ClusterGroupUpgrade    [reconcilePrecaching]    {"FindStatusCondition  PrecachingDone": "&Condition{Type:PrecachingDone,Status:False,ObservedGeneration:0,LastTransitionTime:2023-01-26 17:02:41 +0000 UTC,Reason:PrecachingNotDone,Message:Precaching is required and not done,}"}
      2023-01-26T17:03:30.831Z    INFO    controllers.ClusterGroupUpgrade    [precachingFsm]    {"currentState": "PreparingToStart", "cluster": "worker-1"}
      2023-01-26T17:03:30.835Z    ERROR    controllers.ClusterGroupUpgrade    reconcilePrecaching error    {"error": "[getPreparingConditions] no ManagedClusterView conditions found"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
          /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
       

      Version-Release number of selected component (if applicable):

      4.10.z talm

      How reproducible:

      100%

      Steps to Reproduce:

      1. Create CGU and enable precaching against multiple spokes in one batch
      2.
      3.
      

      Actual results:

      precaching stuck at PreparingToStart for both spokes including the healthy one

      Expected results:

      precaching succeeds on the healthy spoke

      Additional info:

       

              rhn-support-yliu1 Yang Liu
              rhn-support-yliu1 Yang Liu
              Yang Liu Yang Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: