Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: odf-4.17, odf-4.16
Component/s: Documentation
Labels:
- FailedQA
- rosa

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Dev Approval:
?
Docs Approval:
?
PM Approval:
?
QE Approval:
Committed
Target Release:

odf-4.17.4
Intelligence Requested:
Market:

Severity:
Important

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

Upon storage deployment, the StorageCluster typically utilizes nodes available in a single machine pool. During new deployments and day-two node operations, the StorageCluster can operate across multiple machine pools but* osd pod* fails to relocate to a new node and stuck in Pending.

OSD description

topology.rook.io/rack:DoNotSchedule when max skew 1 is exceeded for selector app in (rook-ceph-osd)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 40s default-scheduler 0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.

StorageCluster nodeTopology:

 status: ... nodeTopologies: labels: failure-domain.beta.kubernetes.io/region: - us-west-2 failure-domain.beta.kubernetes.io/zone: - us-west-2a kubernetes.io/hostname: - ip-10-0-0-235.us-west-2.compute.internal - ip-10-0-0-60.us-west-2.compute.internal - ip-10-0-0-71.us-west-2.compute.internal - ip-10-0-0-94.us-west-2.compute.internal topology.rook.io/rack: - rack0 - rack1 - rack2 phase: Ready

Investigation from kmajumder@redhat.com

The new node is ip-10-0-0-145.us-west-2.compute.internal if I am not wrong. Should have been labeled as rack1. I will check in code where the new nodes get labeled from.

oc get nodes -o custom-columns='NAME:.metadata.name,STATUS:.status.conditions[-1].type,TAINTS:.spec.taints[*].key,RACK:.metadata.labels.topology\.rook\.io/rack,INSTANCE_TYPE:.metadata.labels.beta\.kubernetes\.io/instance-type' NAME STATUS TAINTS RACK INSTANCE_TYPE ip-10-0-0-109.us-west-2.compute.internal Ready <none> rack0 m5.12xlarge ip-10-0-0-145.us-west-2.compute.internal Ready <none> rack0 m5.xlarge ip-10-0-0-149.us-west-2.compute.internal Ready node.kubernetes.io/unschedulable rack1 m5.12xlarge ip-10-0-0-40.us-west-2.compute.internal Ready <none> rack2 m5.12xlarge

The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

ROSA HCP

The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

Internal

The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

Does this issue impact your ability to continue to work with the product?

Yes

Is there any workaround available to the best of your knowledge?

Label node with rack label manually

Can this issue be reproduced? If so, please provide the hit rate

yes 5/5

Can this issue be reproduced from the UI?

yes

If this is a regression, please provide more details to justify this:

new deployment type

Steps to Reproduce:
1. Run create machinepool with node and label it with "openshif-storage" tag
2. Select any node with osd and cordon it
3. delete OSD pod on unscheduled node
4. verify all OSD pods are running
5. verify rebalancing complete in reasonable time

The exact date and time when the issue was observed, including timezone details:

Actual results:

OSD pod stuck in pending, CEPH cluster not healthy

Expected results:

The StorageCluster can operate across multiple machine pools. Rebalancing complete in reasonable time

Logs collected and log location:

ocs and ocp must-gather: https://url.corp.redhat.com/e7a31de
logs collected after issue was fixed by @Kaustav Majumder

Additional info:

Should be addressed on ROSA HCP 4.16 and 4.17

We need to ensure nodes across different machinepool may be used on a new StorageSystem installation

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty