-
Bug
-
Resolution: Unresolved
-
Normal
-
ACM 2.10.0, ACM 2.11.Z
-
1
-
False
-
None
-
False
-
-
-
-
MCO Sprint 30, Observability Sprint 31, Observability Sprint 32, Observability Sprint 33, Observability Sprint 34
-
Critical
-
Customer Escalated, Customer Facing, Customer Reported
-
None
Description of problem:
When doing an unpgrade, or just a drain, it would make observability pods to not been correctly scheduled:
I0719 09:33:05.825659 1 drain_controller.go:152] evicting pod open-cluster-management-observability/observability-thanos-receive-default-0 E0719 09:33:05.837467 1 drain_controller.go:152] error when evicting pods/"observability-thanos-receive-default-0" -n "open-cluster-management-observability" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0719 09:33:10.837663 1 drain_controller.go:152] evicting pod open-cluster-management-observability/observability-thanos-receive-default-0 E0719 09:33:10.842430 1 drain_controller.go:152] error when evicting pods/"observability-thanos-receive-default-0" -n "open-cluster-management-observability" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0719 09:33:15.842715 1 drain_controller.go:152] evicting pod open-cluster-management-observability/observability-thanos-receive-default-0 I0719 09:33:15.842757 1 drain_controller.go:182] node master-2.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"observability-thanos-receive-default-0" -n "open-cluster-management-observability": global timeout reached: 1m30s
During the upgrades, where nodes are drained and rebooted one by one, it would happen more frequently.
> oc -n open-cluster-management-observability get pods -o wide | grep receive-default observability-thanos-receive-default-0 1/1 Running 0 112m 10.129.0.63 master-2.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com <none> <none> observability-thanos-receive-default-1 1/1 Running 0 104m 10.128.0.89 master-1.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com <none> <none> observability-thanos-receive-default-2 0/1 ContainerCreating 0 5m52s <none> master-0.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com <none> <none>
For this deployment, there is on pod, that was already scheduled but into a wrong node. And cannot get his PV
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 7m30s default-scheduler Successfully assigned open-cluster-management-observability/observability-thanos-receive-default-2 to master-0.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com Warning FailedAttachVolume 7m30s attachdetach-controller Multi-Attach error for volume "pvc-aeceef9e-66ba-4ffa-a5c7-4e23c1ddf50e" Volume is already exclusively attached to one node and can't be attached to another
you can play with cordoning and killing pods to be able to re-scheduled correctly.
Version-Release number of selected component (if applicable):
happened to me on different upgrades on the path from 4.14 to 4.16
How reproducible:
Steps to Reproduce:
- ...