-
Bug
-
Resolution: Done-Errata
-
Major
-
4.15
-
Important
-
No
-
Rejected
-
False
-
Description of problem:
[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS arm instance types "c7gd.2xlarge , m7gd.xlarge"
Version-Release number of selected component (if applicable):
4.15.3
How reproducible:
Always
Steps to Reproduce:
1. Create an Openshift cluster on AWS with intance types "c7gd.2xlarge , m7gd.xlarge" 2. Check the csinode allocatable volumes count 3. Create statefulset with 1 pvc mounted and max allocatable volumes count replicas with nodeAffinity apiVersion: apps/v1 kind: StatefulSet metadata: name: statefulset-vol-limit spec: serviceName: "my-svc" replicas: $VOL_COUNT_LIMIT selector: matchLabels: app: my-svc template: metadata: labels: app: my-svc spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - $NODE_NAME containers: - name: openshifttest image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339 volumeMounts: - name: data mountPath: /mnt/storage tolerations: - key: "node-role.kubernetes.io/master" effect: "NoSchedule" volumeClaimTemplates: - metadata: name: doc gata spec: accessModes: [ "ReadWriteOnce" ] storageClassName: gp3-csi resources: requests: storage: 1Gi 4. The statefulset all replicas should all become ready.
Actual results:
In step 4, the statefulset 26th replica(pod) stuck at ContainerCreating caused by the volume couldn't be attached to the node(the csinode allocatable volumes count incorrect) $ oc get no/ip-10-0-22-114.ec2.internal -oyaml|grep 'instance' beta.kubernetes.io/instance-type: m7gd.xlarge node.kubernetes.io/instance-type: m7gd.xlarge $ oc get csinode/ip-10-0-22-114.ec2.internal -oyaml apiVersion: storage.k8s.io/v1 kind: CSINode metadata: annotations: storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume creationTimestamp: "2024-03-20T02:16:34Z" name: ip-10-0-22-114.ec2.internal ownerReferences: - apiVersion: v1 kind: Node name: ip-10-0-22-114.ec2.internal uid: acb9a153-bb9b-4c4a-90c1-f3e095173ce2 resourceVersion: "19281" uid: 12507a73-898d-441a-a844-41c7de290b5b spec: drivers: - allocatable: count: 26 name: ebs.csi.aws.com nodeID: i-00ec014c5676a99d2 topologyKeys: - topology.ebs.csi.aws.com/zone $ export VOL_COUNT_LIMIT="26" $ export NODE_NAME="ip-10-0-22-114.ec2.internal" $ envsubst < sts-vol-limit.yaml| oc apply -f - statefulset.apps/statefulset-vol-limit created $ oc get sts NAME READY AGE statefulset-vol-limit 25/26 169m $ oc describe po/statefulset-vol-limit-25 Name: statefulset-vol-limit-25 Namespace: default Priority: 0 Service Account: default Node: ip-10-0-22-114.ec2.internal/10.0.22.114 Start Time: Wed, 20 Mar 2024 18:56:08 +0800 Labels: app=my-svc apps.kubernetes.io/pod-index=25 controller-revision-hash=statefulset-vol-limit-7db55989f7 statefulset.kubernetes.io/pod-name=statefulset-vol-limit-25 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.128.2.53/23"],"mac_address":"0a:58:0a:80:02:35","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.128.0.0... Status: Pending IP: IPs: <none> Controlled By: StatefulSet/statefulset-vol-limit Containers: openshifttest: Container ID: Image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339 Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /mnt/storage from data (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zkwqx (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: data-statefulset-vol-limit-25 ReadOnly: false kube-api-access-zkwqx: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 167m default-scheduler Successfully assigned default/statefulset-vol-limit-25 to ip-10-0-22-114.ec2.internal Warning FailedAttachVolume 166m (x2 over 166m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-b43ec1d0-4fa3-4e87-a80b-6ad912160273" : rpc error: code = Internal desc = Could not attach volume "vol-0a7cb8c5859cf3f96" to node "i-00ec014c5676a99d2": context deadline exceeded Warning FailedAttachVolume 30s (x87 over 166m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-b43ec1d0-4fa3-4e87-a80b-6ad912160273" : rpc error: code = Internal desc = Could not attach volume "vol-0a7cb8c5859cf3f96" to node "i-00ec014c5676a99d2": attachment of disk "vol-0a7cb8c5859cf3f96" failed, expected device to be attached but was attaching
Expected results:
In step4 The statefulset all replicas should all become ready.
Additional info:
The AWS arm instance types "c7gd.2xlarge , m7gd.xlarge" all should be "25" not "26"
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update