[OCPBUGS-37088] [AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS vt1*/g4* instance types - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.17, 4.16.z
Component/s: Storage
Labels:
- qe-premerge-tested

Severity:
Important
Regression:
None
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
Done
Target Version:

4.18.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS vt1*/g4* instance types

Version-Release number of selected component (if applicable):

 4.17.0-0.nightly-2024-07-16-033047

How reproducible:

 Always

Steps to Reproduce:

1. Use instance type "vt1.3xlarge"/"g4ad.xlarge"/"g4dn.xlarge" install Openshift cluster on AWS

2. Check the csinode allocatable volumes count 
$ oc get csinode ip-10-0-53-225.ec2.internal -ojsonpath='{.spec.drivers[?(@.name=="ebs.csi.aws.com")].allocatable.count}'
26

g4ad.xlarge # 25 
g4dn.xlarge # 25
vt1.3xlarge # 26                                                              

$ oc get no/ip-10-0-53-225.ec2.internal -oyaml| grep 'instance-type'
    beta.kubernetes.io/instance-type: vt1.3xlarge
    node.kubernetes.io/instance-type: vt1.3xlarge
3. Create statefulset with pvc(which use the ebs csi storageclass), nodeAnffinity to the same node and set the replicas to the max volumesallocatable count to verify the the csinode allocatable volumes count is correct and all the pods should become Running 

# Test data
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: statefulset-vol-limit
spec:
  serviceName: "my-svc"
  replicas: 26
  selector:
    matchLabels:
      app: my-svc
  template:
    metadata:
      labels:
        app: my-svc
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - ip-10-0-53-225.ec2.internal # Make all volume attach to the same node
      containers:
      - name: openshifttest
        image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
        volumeMounts:
        - name: data
          mountPath: /mnt/storage
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: "NoSchedule"
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      #storageClassName: gp3-csi
      resources:
        requests:
          storage: 1Gi

Actual results:

In step 3 there's some pods stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node

Expected results:

 In step 3 all the pods with pvc should become "Running", and In step 2 the csinode allocatable volumes count should be correct

-> g4ad.xlarge allocatable count should be 24
-> g4dn.xlarge allocatable count should be 24
-> vt1.3xlarge allocatable count should be 24

Additional info:

  ...
attach or mount volumes: unmounted volumes=[data12 data6], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition
06-25 17:51:23.680      Warning  FailedAttachVolume      4m1s (x13 over 14m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-d08d4133-f589-4aa3-bbef-f988058c419a" : rpc error: code = Internal desc = Could not attach volume "vol-0aa138f453d414ec3" to node "i-09d532f5155b3c05d": attachment of disk "vol-0aa138f453d414ec3" failed, expected device to be attached but was attaching
06-25 17:51:23.681      Warning  FailedMount             3m40s (x3 over 10m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[data6 data12], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition
...

links to

openshift/aws-ebs-csi-driver#274: OCPBUGS-37088: UPSTREAM: 2108, 2115: Fix allocatable volumes count for vt1 and g4

RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update

Assignee:: Maxim Patlasov

Reporter:: Penghao Wang

QA Contact:: Penghao Wang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/07/16 9:35 AM

Updated:: 2025/02/25 4:39 AM

Resolved:: 2025/02/25 4:39 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide