Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Cluster Autoscaler
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Low
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

accessDescription of problem:

The use case is autoscaling nodes with local disks, where a CSI driver is able to provide a StorageClass with capacity once the node is scaled up. An example of such a CSI driver is the LVM Storage Operator. When properly configured, a new node will automatically add the local disk to the StorageClass capacity.

However when a MachineSet is configured with a MachineAutoscaler that scales from 0, a new Pod that requests hyperconverged storage from a CSI driver is stuck in "Pending".

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Most of the times but not always.

Steps to Reproduce:

    1. Create a LVM Storage storageclass by way of LVMCluster resource. The LVMCluster resource has a nodeSelector targetting nodes created by a MachineSet. The LVM Storage operator will create a StorageClass and the CSI related objects (CSIStorageCapacity, CSIDriver) although CSIStorageCapacity.capacity is empty.
    2. Create a MachineSet, where the template.spec.metadata contains a label matching the nodeSelector we defined in 1. We also set capacity.cluster-autoscaler.kubernetes.io/ephemeral-disk` with a disk size to the machine set as explained in the resolution of https://access.redhat.com/solutions/7041119
    3. Create a Pod with a PersistentVolumeClaim pointing to the StorageClass defined at 1.

Actual results:

The Pod scheduling will fail, and the AutoScaler will not create Machines to satisfy the scheduling.

$ oc get pod nginx-cb7659cfd-2dlqj
NAME                    READY   STATUS    RESTARTS   AGE
nginx-cb7659cfd-2dlqj   0/1     Pending   0          42s

$ oc get events
21s         Normal    NotTriggerScaleUp      pod/nginx-cb7659cfd-2dlqj            pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {kubernetes.io/arch: arm64}, 1 node(s) did not have enough free storage
35s         Normal    WaitForPodScheduled    persistentvolumeclaim/my-lvm-claim   waiting for pod nginx-cb7659cfd-2dlqj to be scheduled
44s         Normal    ScalingReplicaSet      deployment/nginx                     Scaled up replica set nginx-cb7659cfd to 1 from 0
44s         Normal    SuccessfulCreate       replicaset/nginx-cb7659cfd           Created pod: nginx-cb7659cfd-2dlqj
44s         Warning   FailedScheduling       pod/nginx-cb7659cfd-2dlqj            0/8 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
1m33s       Warning   NotEnoughCapacity      persistentvolumeclaim/my-lvm-claim   Requested storage (1Gi) is greater than available capacity on any node ().
5m          Normal    WaitForFirstConsumer   persistentvolumeclaim/my-lvm-claim   waiting for first consumer to be created before binding

See cluster autoscaler logs below.

Expected results:

Machine creation is triggered, Pod is scheduled and running

Additional info:

workaround
==========

It is possible to trigger the autoscaling and correct Pod scheduling by setting CSIDriver spec.storageCapacity=false. However by doing this, we take the risk of scheduling Pods with storage requests which are actually impossible to satisfy.

Assignee:: Michael McCune

Reporter:: Emmanuel Kasprzyk (Inactive)

Need Info From:: None

Contributors:: None

QA Contact:: Paul Rozehnal

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/02/03 3:38 PM

Updated:: 2025/10/10 10:35 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide