[ACM-13042] Upgrading OCP stuck because of observability pods - Red Hat Issue Tracker

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: Future
Affects Version/s: ACM 2.10.0, ACM 2.11.Z
Component/s: Observability
Labels:

Story Points:
1
Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:
RH Private Keywords:

Sprint:
MCO Sprint 30, Observability Sprint 31, Observability Sprint 32, Observability Sprint 33, Observability Sprint 34, Observability Sprint 35, Obs Sprint 36, Observability Sprint 37
Severity:
Critical
Customer Impact:

Customer Escalated, Customer Facing, Customer Reported

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:
PX Review Complete:

Description of problem:

When doing an unpgrade, or just a drain, it would make observability pods to not been correctly scheduled:

I0719 09:33:05.825659       1 drain_controller.go:152] evicting pod open-cluster-management-observability/observability-thanos-receive-default-0
E0719 09:33:05.837467       1 drain_controller.go:152] error when evicting pods/"observability-thanos-receive-default-0" -n "open-cluster-management-observability" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0719 09:33:10.837663       1 drain_controller.go:152] evicting pod open-cluster-management-observability/observability-thanos-receive-default-0
E0719 09:33:10.842430       1 drain_controller.go:152] error when evicting pods/"observability-thanos-receive-default-0" -n "open-cluster-management-observability" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0719 09:33:15.842715       1 drain_controller.go:152] evicting pod open-cluster-management-observability/observability-thanos-receive-default-0
I0719 09:33:15.842757       1 drain_controller.go:182] node master-2.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"observability-thanos-receive-default-0" -n "open-cluster-management-observability": global timeout reached: 1m30s

During the upgrades, where nodes are drained and rebooted one by one, it would happen more frequently.

> oc -n open-cluster-management-observability get pods -o wide | grep receive-default
observability-thanos-receive-default-0                     1/1     Running             0          112m    10.129.0.63    master-2.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com   <none>           <none>
observability-thanos-receive-default-1                     1/1     Running             0          104m    10.128.0.89    master-1.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com   <none>           <none>
observability-thanos-receive-default-2                     0/1     ContainerCreating   0          5m52s   <none>         master-0.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com   <none>           <none>

For this deployment, there is on pod, that was already scheduled but into a wrong node. And cannot get his PV

Events:
  Type     Reason              Age    From                     Message
  ----     ------              ----   ----                     -------
  Normal   Scheduled           7m30s  default-scheduler        Successfully assigned open-cluster-management-observability/observability-thanos-receive-default-2 to master-0.hub-2.el8k.se-lab.eng.rdu2.dc.redhat.com
  Warning  FailedAttachVolume  7m30s  attachdetach-controller  Multi-Attach error for volume "pvc-aeceef9e-66ba-4ffa-a5c7-4e23c1ddf50e" Volume is already exclusively attached to one node and can't be attached to another

you can play with cordoning and killing pods to be able to re-scheduled correctly.

Version-Release number of selected component (if applicable):

happened to me on different upgrades on the path from 4.14 to 4.16

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Assignee:: Moad Zardab

Reporter:: Jose Gato Luis

QA Contact:: Xiang Yin

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/08/01 9:24 AM

Updated:: 2025/02/05 12:24 PM

Resolved:: 2025/02/05 12:24 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide