Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.15
Component/s: Storage / Operators
Labels:
- qe-premerge-tested

Severity:
Important
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS arm instance types "c7gd.2xlarge , m7gd.xlarge"

Version-Release number of selected component (if applicable):

    4.15.3

How reproducible:

    Always

Steps to Reproduce:

    1. Create an Openshift cluster on AWS with intance types "c7gd.2xlarge , m7gd.xlarge"
    2. Check the csinode allocatable volumes count 
    3. Create statefulset with 1 pvc mounted and max allocatable volumes count replicas with nodeAffinity 
    apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: statefulset-vol-limit
spec:
  serviceName: "my-svc"
  replicas: $VOL_COUNT_LIMIT
  selector:
    matchLabels:
      app: my-svc
  template:
    metadata:
      labels:
        app: my-svc
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - $NODE_NAME
      containers:
      - name: openshifttest
        image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
        volumeMounts:
        - name: data
          mountPath: /mnt/storage
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: "NoSchedule"
  volumeClaimTemplates:
  - metadata:
      name: doc gata
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: gp3-csi
      resources:
        requests:
          storage: 1Gi
    4. The statefulset all replicas should all become ready.

Actual results:

In step 4, the statefulset 26th replica(pod) stuck at ContainerCreating caused by the volume couldn't be attached to the node(the csinode allocatable volumes count incorrect) 
$ oc get no/ip-10-0-22-114.ec2.internal -oyaml|grep 'instance'
    beta.kubernetes.io/instance-type: m7gd.xlarge
    node.kubernetes.io/instance-type: m7gd.xlarge
 $ oc get csinode/ip-10-0-22-114.ec2.internal -oyaml
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  annotations:
    storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume
  creationTimestamp: "2024-03-20T02:16:34Z"
  name: ip-10-0-22-114.ec2.internal
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: ip-10-0-22-114.ec2.internal
    uid: acb9a153-bb9b-4c4a-90c1-f3e095173ce2
  resourceVersion: "19281"
  uid: 12507a73-898d-441a-a844-41c7de290b5b
spec:
  drivers:
  - allocatable:
      count: 26
    name: ebs.csi.aws.com
    nodeID: i-00ec014c5676a99d2
    topologyKeys:
    - topology.ebs.csi.aws.com/zone
$ export VOL_COUNT_LIMIT="26"
$ export NODE_NAME="ip-10-0-22-114.ec2.internal"
$ envsubst < sts-vol-limit.yaml| oc apply -f -
statefulset.apps/statefulset-vol-limit created
$ oc get sts
NAME                    READY   AGE
statefulset-vol-limit   25/26   169m

$ oc describe po/statefulset-vol-limit-25
Name:             statefulset-vol-limit-25
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-10-0-22-114.ec2.internal/10.0.22.114
Start Time:       Wed, 20 Mar 2024 18:56:08 +0800
Labels:           app=my-svc
                  apps.kubernetes.io/pod-index=25
                  controller-revision-hash=statefulset-vol-limit-7db55989f7
                  statefulset.kubernetes.io/pod-name=statefulset-vol-limit-25
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.128.2.53/23"],"mac_address":"0a:58:0a:80:02:35","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.128.0.0...
Status:           Pending
IP:
IPs:              <none>
Controlled By:    StatefulSet/statefulset-vol-limit
Containers:
  openshifttest:
    Container ID:
    Image:          quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /mnt/storage from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zkwqx (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-statefulset-vol-limit-25
    ReadOnly:   false
  kube-api-access-zkwqx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason              Age                  From                     Message
  ----     ------              ----                 ----                     -------
  Normal   Scheduled           167m                 default-scheduler        Successfully assigned default/statefulset-vol-limit-25 to ip-10-0-22-114.ec2.internal
  Warning  FailedAttachVolume  166m (x2 over 166m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-b43ec1d0-4fa3-4e87-a80b-6ad912160273" : rpc error: code = Internal desc = Could not attach volume "vol-0a7cb8c5859cf3f96" to node "i-00ec014c5676a99d2": context deadline exceeded
  Warning  FailedAttachVolume  30s (x87 over 166m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-b43ec1d0-4fa3-4e87-a80b-6ad912160273" : rpc error: code = Internal desc = Could not attach volume "vol-0a7cb8c5859cf3f96" to node "i-00ec014c5676a99d2": attachment of disk "vol-0a7cb8c5859cf3f96" failed, expected device to be attached but was attaching

Expected results:

    In step4 The statefulset all replicas should all become ready.

Additional info:

    The AWS arm instance types "c7gd.2xlarge , m7gd.xlarge" all should be "25" not "26"

links to

openshift/aws-ebs-csi-driver#261: OCPBUGS-31101: UPSTREAM: 1966: Add missing instances to instance store volumes table

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Roman Bednar

Reporter:: Penghao Wang

QA Contact:: Wei Duan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/03/20 1:46 PM

Updated:: 2024/06/27 11:40 AM

Resolved:: 2024/06/27 11:40 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates