Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-13042

Upgrading OCP stuck because of observability pods

XMLWordPrintable

    • 1
    • False
    • None
    • False
    • MCO Sprint 30, Observability Sprint 31, Observability Sprint 32, Observability Sprint 33, Observability Sprint 34
    • Critical
    • Customer Escalated, Customer Facing, Customer Reported
    • None

      Description of problem:

      When doing an unpgrade, or just a drain, it would make observability pods to not been correctly scheduled:

      I0719 09:33:05.825659       1 drain_controller.go:152] evicting pod open-cluster-management-observability/observability-thanos-receive-default-0
      E0719 09:33:05.837467       1 drain_controller.go:152] error when evicting pods/"observability-thanos-receive-default-0" -n "open-cluster-management-observability" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0719 09:33:10.837663       1 drain_controller.go:152] evicting pod open-cluster-management-observability/observability-thanos-receive-default-0
      E0719 09:33:10.842430       1 drain_controller.go:152] error when evicting pods/"observability-thanos-receive-default-0" -n "open-cluster-management-observability" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0719 09:33:15.842715       1 drain_controller.go:152] evicting pod open-cluster-management-observability/observability-thanos-receive-default-0
      I0719 09:33:15.842757       1 drain_controller.go:182] node master-2.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"observability-thanos-receive-default-0" -n "open-cluster-management-observability": global timeout reached: 1m30s 

      During the upgrades, where nodes are drained and rebooted one by one, it would happen more frequently.

      > oc -n open-cluster-management-observability get pods -o wide | grep receive-default
      observability-thanos-receive-default-0                     1/1     Running             0          112m    10.129.0.63    master-2.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com   <none>           <none>
      observability-thanos-receive-default-1                     1/1     Running             0          104m    10.128.0.89    master-1.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com   <none>           <none>
      observability-thanos-receive-default-2                     0/1     ContainerCreating   0          5m52s   <none>         master-0.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com   <none>           <none> 

      For this deployment, there is on pod, that was already scheduled but into a wrong node. And cannot get his PV

      Events:
        Type     Reason              Age    From                     Message
        ----     ------              ----   ----                     -------
        Normal   Scheduled           7m30s  default-scheduler        Successfully assigned open-cluster-management-observability/observability-thanos-receive-default-2 to master-0.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com
        Warning  FailedAttachVolume  7m30s  attachdetach-controller  Multi-Attach error for volume "pvc-aeceef9e-66ba-4ffa-a5c7-4e23c1ddf50e" Volume is already exclusively attached to one node and can't be attached to another 

      you can play with cordoning and killing pods to be able to re-scheduled correctly.

       

      Version-Release number of selected component (if applicable):

      happened to me on different upgrades on the path from 4.14 to 4.16

      How reproducible:

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

              mzardab@redhat.com Moad Zardab
              jgato@redhat.com Jose Gato Luis
              Xiang Yin Xiang Yin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: