Loading...

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Storage / Kubernetes
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
MON Sprint 265
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    The prometheus-k8s-0 pod failed to created due to volume deleting with unknown reason, the pvc and pv still exit, but the volume deleting with unknown reason

    oc -n openshift-monitoring describe pod prometheus-k8s-0
Name:                 prometheus-k8s-0
Namespace:            openshift-monitoring
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      prometheus-k8s
Node:                 ip-10-0-4-67.us-east-2.compute.internal/10.0.4.67
Start Time:           Thu, 05 Sep 2024 08:55:24 +0000
Labels:               app.kubernetes.io/component=prometheus
                      app.kubernetes.io/instance=k8s
                      app.kubernetes.io/managed-by=prometheus-operator
                      app.kubernetes.io/name=prometheus
                      app.kubernetes.io/part-of=openshift-monitoring
                      app.kubernetes.io/version=2.52.0
                      apps.kubernetes.io/pod-index=0
                      controller-revision-hash=prometheus-k8s-856d7759cc
                      operator.prometheus.io/name=k8s
                      operator.prometheus.io/shard=0
                      prometheus=k8s
                      statefulset.kubernetes.io/pod-name=prometheus-k8s-0
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.130.2.8/23"],"mac_address":"0a:58:0a:82:02:08","gateway_ips":["10.130.2.1"],"routes":[{"dest":"10.128.0.0/...
                      kubectl.kubernetes.io/default-container: prometheus
                      openshift.io/required-scc: nonroot
                      openshift.io/scc: nonroot
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        StatefulSet/prometheus-k8s
Init Containers:
  init-config-reloader:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:356d4ce991042a2affc27988c328a2ce686a52132c3ca1b630bce6b7965e8f90
    Image ID:
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/prometheus-config-reloader
    Args:
      --watch-interval=0
      --listen-address=:8080
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
      --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_NAME:  prometheus-k8s-0 (v1:metadata.name)
      SHARD:     0
    Mounts:
      /etc/prometheus/config from config (rw)
      /etc/prometheus/config_out from config-out (rw)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /etc/prometheus/web_config/web-config.yaml from web-config (ro,path="web-config.yaml")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fqmjd (ro)
Containers:
  prometheus:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:87b350932e17e0b93bf337c1e6923b39b92ba21df119a9de8c3c8bd603d00e44
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --web.enable-lifecycle
      --web.external-url=https://console-openshift-console.apps.liqcui-sdn2ovn.qe.devcluster.openshift.com/monitoring
      --web.route-prefix=/
      --web.listen-address=127.0.0.1:9090
      --storage.tsdb.retention.time=15d
      --storage.tsdb.path=/prometheus
      --web.config.file=/etc/prometheus/web_config/web-config.yaml
      --scrape.timestamp-tolerance=15ms
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        70m
      memory:     1Gi
    Liveness:     exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/healthy; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/healthy; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=6
    Readiness:    exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=3
    Startup:      exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=60s #success=1 #failure=60
    Environment:  <none>
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (rw)
      /etc/prometheus/certs from tls-assets (ro)
      /etc/prometheus/config_out from config-out (ro)
      /etc/prometheus/configmaps/kubelet-serving-ca-bundle from configmap-kubelet-serving-ca-bundle (ro)
      /etc/prometheus/configmaps/metrics-client-ca from configmap-metrics-client-ca (ro)
      /etc/prometheus/configmaps/serving-certs-ca-bundle from configmap-serving-certs-ca-bundle (ro)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /etc/prometheus/secrets/kube-rbac-proxy from secret-kube-rbac-proxy (ro)
      /etc/prometheus/secrets/metrics-client-certs from secret-metrics-client-certs (ro)
      /etc/prometheus/secrets/prometheus-k8s-kube-rbac-proxy-web from secret-prometheus-k8s-kube-rbac-proxy-web (ro)
      /etc/prometheus/secrets/prometheus-k8s-thanos-sidecar-tls from secret-prometheus-k8s-thanos-sidecar-tls (ro)
      /etc/prometheus/secrets/prometheus-k8s-tls from secret-prometheus-k8s-tls (ro)
      /etc/prometheus/web_config/web-config.yaml from web-config (ro,path="web-config.yaml")
      /prometheus from prometheus-k8s-db (rw,path="prometheus-db")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fqmjd (ro)
  config-reloader:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:356d4ce991042a2affc27988c328a2ce686a52132c3ca1b630bce6b7965e8f90
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/prometheus-config-reloader
    Args:
      --listen-address=localhost:8080
      --web-config-file=/etc/prometheus/web_config/web-config.yaml
      --reload-url=http://localhost:9090/-/reload
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
      --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_NAME:  prometheus-k8s-0 (v1:metadata.name)
      SHARD:     0
    Mounts:
      /etc/prometheus/config from config (rw)
      /etc/prometheus/config_out from config-out (rw)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /etc/prometheus/web_config/web-config.yaml from web-config (ro,path="web-config.yaml")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fqmjd (ro)
  thanos-sidecar:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:993e54a6864d7fe7fa61d3faf5a98a4438dec0a447b0d1e837cc92ea1a0ce16e
    Image ID:
    Ports:         10902/TCP, 10901/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      sidecar
      --prometheus.url=http://localhost:9090/
      --tsdb.path=/prometheus
      --http-address=127.0.0.1:10902
      --grpc-server-tls-cert=/etc/tls/grpc/server.crt
      --grpc-server-tls-key=/etc/tls/grpc/server.key
      --grpc-server-tls-client-ca=/etc/tls/grpc/ca.crt
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /etc/tls/grpc from secret-grpc-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fqmjd (ro)
  kube-rbac-proxy-web:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70458d010bd9f4e9c43b6452fe79e529af926deab2714e10ba1366789ec15d9f
    Image ID:
    Port:          9091/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:9091
      --upstream=http://127.0.0.1:9090
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --ignore-paths=/-/healthy,/-/ready
      --tls-min-version=VersionTLS12
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     15Mi
    Environment:  <none>
    Mounts:
      /etc/kube-rbac-proxy from secret-prometheus-k8s-kube-rbac-proxy-web (rw)
      /etc/tls/private from secret-prometheus-k8s-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fqmjd (ro)
  kube-rbac-proxy:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70458d010bd9f4e9c43b6452fe79e529af926deab2714e10ba1366789ec15d9f
    Image ID:
    Port:          9092/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:9092
      --upstream=http://127.0.0.1:9090
      --allow-paths=/metrics,/federate
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --client-ca-file=/etc/tls/client/client-ca.crt
      --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --tls-min-version=VersionTLS12
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     15Mi
    Environment:  <none>
    Mounts:
      /etc/kube-rbac-proxy from secret-kube-rbac-proxy (rw)
      /etc/tls/client from configmap-metrics-client-ca (ro)
      /etc/tls/private from secret-prometheus-k8s-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fqmjd (ro)
  kube-rbac-proxy-thanos:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70458d010bd9f4e9c43b6452fe79e529af926deab2714e10ba1366789ec15d9f
    Image ID:
    Port:          10903/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=[$(POD_IP)]:10903
      --upstream=http://127.0.0.1:10902
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --client-ca-file=/etc/tls/client/client-ca.crt
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --allow-paths=/metrics
      --tls-min-version=VersionTLS12
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_IP:   (v1:status.podIP)
    Mounts:
      /etc/kube-rbac-proxy from secret-kube-rbac-proxy (ro)
      /etc/tls/client from configmap-metrics-client-ca (ro)
      /etc/tls/private from secret-prometheus-k8s-thanos-sidecar-tls (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fqmjd (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False
  Initialized                 False
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  prometheus-k8s-db:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  prometheus-k8s-db-prometheus-k8s-0
    ReadOnly:   false
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s
    Optional:    false
  tls-assets:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          prometheus-k8s-tls-assets-0
    SecretOptionalName:  <nil>
  config-out:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  secret-prometheus-k8s-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-tls
    Optional:    false
  secret-prometheus-k8s-thanos-sidecar-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-thanos-sidecar-tls
    Optional:    false
  secret-kube-rbac-proxy:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-rbac-proxy
    Optional:    false
  secret-prometheus-k8s-kube-rbac-proxy-web:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-kube-rbac-proxy-web
    Optional:    false
  secret-metrics-client-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  metrics-client-certs
    Optional:    false
  configmap-serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  configmap-kubelet-serving-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kubelet-serving-ca-bundle
    Optional:  false
  configmap-metrics-client-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      metrics-client-ca
    Optional:  false
  prometheus-k8s-rulefiles-0:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-rulefiles-0
    Optional:  false
  web-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-web-config
    Optional:    false
  prometheus-trusted-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-trusted-ca-bundle
    Optional:  false
  secret-grpc-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-grpc-tls-9sg4kpkjnt4o0
    Optional:    false
  kube-api-access-fqmjd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/infra=
Tolerations:                 node-role.kubernetes.io/infra=reserved:NoSchedule
                             node-role.kubernetes.io/infra=reserved:NoExecute
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason              Age                    From                     Message
  ----     ------              ----                   ----                     -------
  Warning  FailedScheduling    149m                   default-scheduler        0/31 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 2 node(s) had volume node affinity conflict, 22 node(s) didn't match Pod's node affinity/selector, 5 node(s) were unschedulable. preemption: 0/31 nodes are available: 31 Preemption is not helpful for scheduling.
  Warning  FailedScheduling    144m                   default-scheduler        0/31 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 2 node(s) had volume node affinity conflict, 21 node(s) didn't match Pod's node affinity/selector, 6 node(s) were unschedulable. preemption: 0/31 nodes are available: 31 Preemption is not helpful for scheduling.
  Warning  FailedScheduling    138m                   default-scheduler        0/31 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 2 node(s) had volume node affinity conflict, 21 node(s) didn't match Pod's node affinity/selector, 6 node(s) were unschedulable. preemption: 0/31 nodes are available: 31 Preemption is not helpful for scheduling.
  Normal   Scheduled           133m                   default-scheduler        Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-4-67.us-east-2.compute.internal
  Warning  FailedAttachVolume  111m (x19 over 133m)   attachdetach-controller  AttachVolume.Attach failed for volume "pvc-622bca2b-5053-4d05-ac8c-95c820ace8f3" : volume attachment is being deleted
  Warning  FailedAttachVolume  3m58s (x60 over 110m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-622bca2b-5053-4d05-ac8c-95c820ace8f3" : volume attachment is being deleted

[ocpadmin@ip-10-0-0-179 ~]$ oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-cc54ebf1-3460-42ad-9b5f-91e350adfb83   2Gi        RWO            gp3-csi        <unset>                 5h48m
alertmanager-main-db-alertmanager-main-1   Bound    pvc-6a8e3453-bd96-4c80-95c7-d1202c1f49e3   2Gi        RWO            gp3-csi        <unset>                 5h48m
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-622bca2b-5053-4d05-ac8c-95c820ace8f3   100Gi      RWO            gp3-csi        <unset>                 5h48m
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-5185de21-13da-4b6e-8dff-b9a6d39db1c0   100Gi      RWO            gp3-csi        <unset>                 5h48m
[ocpadmin@ip-10-0-0-179 ~]$ oc -n openshift-monitoring get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                           STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
pvc-5185de21-13da-4b6e-8dff-b9a6d39db1c0   100Gi      RWO            Delete           Bound    openshift-monitoring/prometheus-k8s-db-prometheus-k8s-1         gp3-csi        <unset>                          5h48m
pvc-622bca2b-5053-4d05-ac8c-95c820ace8f3   100Gi      RWO            Delete           Bound    openshift-monitoring/prometheus-k8s-db-prometheus-k8s-0         gp3-csi        <unset>                          5h48m
pvc-6a8e3453-bd96-4c80-95c7-d1202c1f49e3   2Gi        RWO            Delete           Bound    openshift-monitoring/alertmanager-main-db-alertmanager-main-1   gp3-csi        <unset>                          5h48m
pvc-cc54ebf1-3460-42ad-9b5f-91e350adfb83   2Gi        RWO            Delete           Bound    openshift-monitoring/alertmanager-main-db-alertmanager-main-0   gp3-csi        <unset>                          5h48m
[ocpadmin@ip-10-0-0-179 ~]${code}
Version-Release number of selected component (if applicable):
{code:none}

How reproducible:

Steps to Reproduce:

    1. Creating OCP with SDN on AWS, the instance type is c5n.metal, scale out to 24 worker node and 3 infra node and 1 workload
    2. Migration SDN to OVN using oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'

Actual results:

    The pod prometheus-k8s-0 failed to startup due to FailedAttachVolume  3m58s (x60 over 110m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-622bca2b-5053-4d05-ac8c-95c820ace8f3" : volume attachment is being deleted, it will block infra node drain node, and block migrate to ovn from sdn

Expected results:

    The pod prometheus-k8s-0 startup properly

Additional info:

is duplicated by

OCPBUGS-55472 Stray Volume Attachment prevents pod starts

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates