-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.12
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
OCPNODE Sprint 240 (Blue)
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Pods in an ImagePullBackOff state cannot be terminated properly.
A customer opened a support case due to pods in the openshift-monitoring namespace not running:
[tnierman@tnierman] >> oc get po -n openshift-monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 6/6 Running 1 (3d9h ago) 3d9h 10.130.2.7 ip-10-201-58-4.ec2.internal <none> <none> alertmanager-main-1 6/6 Running 1 (3d9h ago) 3d10h 10.131.2.7 ip-10-201-57-179.ec2.internal <none> <none> cluster-monitoring-operator-bf9f46df6-cpjrg 2/2 Running 0 3d10h 10.129.0.44 ip-10-201-59-204.ec2.internal <none> <none> configure-alertmanager-operator-cd4cfd54b-t55nl 1/1 Running 0 3d10h 10.129.4.16 ip-10-201-59-153.ec2.internal <none> <none> configure-alertmanager-operator-registry-gj8p6 1/1 Running 0 3d10h 10.129.4.28 ip-10-201-59-153.ec2.internal <none> <none> kube-state-metrics-54c69d9547-8n4q7 3/3 Running 0 3d9h 10.131.2.37 ip-10-201-57-179.ec2.internal <none> <none> node-exporter-5b84s 2/2 Running 2 3d11h 10.201.59.204 ip-10-201-59-204.ec2.internal <none> <none> node-exporter-6cc65 2/2 Running 2 3d11h 10.201.57.197 ip-10-201-57-197.ec2.internal <none> <none> node-exporter-7ft6p 2/2 Running 2 3d11h 10.201.57.179 ip-10-201-57-179.ec2.internal <none> <none> node-exporter-7rtqn 2/2 Running 2 3d11h 10.201.58.4 ip-10-201-58-4.ec2.internal <none> <none> node-exporter-7tqqf 2/2 Running 2 3d11h 10.201.59.142 ip-10-201-59-142.ec2.internal <none> <none> node-exporter-fdtlt 2/2 Running 2 3d11h 10.201.59.153 ip-10-201-59-153.ec2.internal <none> <none> node-exporter-pn4v5 2/2 Running 2 3d11h 10.201.57.199 ip-10-201-57-199.ec2.internal <none> <none> node-exporter-sxhqr 2/2 Running 2 3d11h 10.201.58.141 ip-10-201-58-141.ec2.internal <none> <none> node-exporter-tdkc5 2/2 Running 2 3d11h 10.201.58.237 ip-10-201-58-237.ec2.internal <none> <none> openshift-state-metrics-86776c5fbf-td6t7 3/3 Running 0 3d9h 10.131.2.42 ip-10-201-57-179.ec2.internal <none> <none> osd-rebalance-infra-nodes-28190505-mljk4 0/1 Completed 0 36m 10.130.3.100 ip-10-201-58-4.ec2.internal <none> <none> osd-rebalance-infra-nodes-28190520-slprt 0/1 Completed 0 21m 10.130.3.108 ip-10-201-58-4.ec2.internal <none> <none> osd-rebalance-infra-nodes-28190535-t7n6s 0/1 Completed 0 6m31s 10.130.3.114 ip-10-201-58-4.ec2.internal <none> <none> prometheus-adapter-6f5f68dd7b-fq4fs 1/1 Running 0 3d9h 10.131.2.43 ip-10-201-57-179.ec2.internal <none> <none> prometheus-adapter-6f5f68dd7b-zqv5p 1/1 Running 0 3d10h 10.129.4.22 ip-10-201-59-153.ec2.internal <none> <none> prometheus-k8s-0 6/6 Running 0 3d11h 10.129.4.7 ip-10-201-59-153.ec2.internal <none> <none> prometheus-k8s-1 6/6 Running 0 3d10h 10.131.2.8 ip-10-201-57-179.ec2.internal <none> <none> prometheus-operator-54f8f55f65-rj487 2/2 Running 0 3d9h 10.131.2.47 ip-10-201-57-179.ec2.internal <none> <none> prometheus-operator-admission-webhook-6d6dc5795c-86mk7 1/1 Running 0 3d9h 10.131.2.48 ip-10-201-57-179.ec2.internal <none> <none> prometheus-operator-admission-webhook-6d6dc5795c-ltzvg 1/1 Running 0 3d10h 10.129.4.26 ip-10-201-59-153.ec2.internal <none> <none> sre-dns-latency-exporter-bf9tf 1/1 Running 3 199d 10.128.2.2 ip-10-201-59-142.ec2.internal <none> <none> sre-dns-latency-exporter-kjgmr 0/1 Terminating 15 601d 10.129.4.3 ip-10-201-59-153.ec2.internal <none> <none> sre-dns-latency-exporter-kpzc9 1/1 Running 13 601d 10.129.0.2 ip-10-201-59-204.ec2.internal <none> <none> sre-dns-latency-exporter-p6l2q 1/1 Running 13 601d 10.128.0.4 ip-10-201-57-199.ec2.internal <none> <none> sre-dns-latency-exporter-q2f2w 1/1 Running 13 601d 10.130.0.5 ip-10-201-58-237.ec2.internal <none> <none> sre-dns-latency-exporter-r5tdm 1/1 Running 14 601d 10.130.2.6 ip-10-201-58-4.ec2.internal <none> <none> sre-dns-latency-exporter-sk9ms 1/1 Running 14 601d 10.131.2.2 ip-10-201-57-179.ec2.internal <none> <none> sre-dns-latency-exporter-x62x8 1/1 Running 3 199d 10.128.4.4 ip-10-201-57-197.ec2.internal <none> <none> sre-dns-latency-exporter-xrnqd 0/1 ImagePullBackOff 2 199d 10.131.4.7 ip-10-201-58-141.ec2.internal <none> <none> sre-ebs-iops-reporter-3-mwf64 0/1 ImagePullBackOff 0 3d10h 10.129.4.27 ip-10-201-59-153.ec2.internal <none> <none> sre-stuck-ebs-vols-3-8g9bm 1/1 Running 0 3d10h 10.129.4.17 ip-10-201-59-153.ec2.internal <none> <none> telemeter-client-6dbcbb9dbf-pm64k 3/3 Running 0 3d9h 10.129.4.55 ip-10-201-59-153.ec2.internal <none> <none> thanos-querier-8dcc7884-d4rdv 6/6 Running 0 3d10h 10.129.4.29 ip-10-201-59-153.ec2.internal <none> <none> thanos-querier-8dcc7884-d8gsx 6/6 Running 0 3d9h 10.131.2.45 ip-10-201-57-179.ec2.internal <none> <none> token-refresher-c7dfd99f8-rbv9j 1/1 Running 0 3d9h 10.131.2.44 ip-10-201-57-179.ec2.internal <none> <none>
After oc delete-ing the pods in an ImagePullBackOff state, they entered a Terminating state (as expected):
[tnierman@tnierman] >> oc get po -n openshift-monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 6/6 Running 1 (3d9h ago) 3d9h alertmanager-main-1 6/6 Running 1 (3d9h ago) 3d10h cluster-monitoring-operator-bf9f46df6-cpjrg 2/2 Running 0 3d10h configure-alertmanager-operator-cd4cfd54b-t55nl 1/1 Running 0 3d10h configure-alertmanager-operator-registry-gj8p6 1/1 Running 0 3d10h kube-state-metrics-54c69d9547-8n4q7 3/3 Running 0 3d9h node-exporter-5b84s 2/2 Running 2 3d11h node-exporter-6cc65 2/2 Running 2 3d11h node-exporter-7ft6p 2/2 Running 2 3d11h node-exporter-7rtqn 2/2 Running 2 3d11h node-exporter-7tqqf 2/2 Running 2 3d11h node-exporter-fdtlt 2/2 Running 2 3d11h node-exporter-pn4v5 2/2 Running 2 3d11h node-exporter-sxhqr 2/2 Running 2 3d11h node-exporter-tdkc5 2/2 Running 2 3d11h openshift-state-metrics-86776c5fbf-td6t7 3/3 Running 0 3d9h osd-rebalance-infra-nodes-28190505-mljk4 0/1 Completed 0 37m osd-rebalance-infra-nodes-28190520-slprt 0/1 Completed 0 22m osd-rebalance-infra-nodes-28190535-t7n6s 0/1 Completed 0 7m20s prometheus-adapter-6f5f68dd7b-fq4fs 1/1 Running 0 3d9h prometheus-adapter-6f5f68dd7b-zqv5p 1/1 Running 0 3d10h prometheus-k8s-0 6/6 Running 0 3d11h prometheus-k8s-1 6/6 Running 0 3d10h prometheus-operator-54f8f55f65-rj487 2/2 Running 0 3d9h prometheus-operator-admission-webhook-6d6dc5795c-86mk7 1/1 Running 0 3d9h prometheus-operator-admission-webhook-6d6dc5795c-ltzvg 1/1 Running 0 3d10h sre-dns-latency-exporter-bf9tf 1/1 Running 3 199d sre-dns-latency-exporter-kjgmr 0/1 Terminating 15 601d sre-dns-latency-exporter-kpzc9 1/1 Running 13 601d sre-dns-latency-exporter-p6l2q 1/1 Running 13 601d sre-dns-latency-exporter-q2f2w 1/1 Running 13 601d sre-dns-latency-exporter-r5tdm 1/1 Running 14 601d sre-dns-latency-exporter-sk9ms 1/1 Running 14 601d sre-dns-latency-exporter-x62x8 1/1 Running 3 199d sre-dns-latency-exporter-xrnqd 0/1 Terminating 2 199d sre-ebs-iops-reporter-3-mwf64 0/1 Terminating 0 3d10h sre-ebs-iops-reporter-3-xrltz 0/1 Init:0/1 0 2s sre-stuck-ebs-vols-3-8g9bm 1/1 Running 0 3d10h telemeter-client-6dbcbb9dbf-pm64k 3/3 Running 0 3d9h thanos-querier-8dcc7884-d4rdv 6/6 Running 0 3d10h thanos-querier-8dcc7884-d8gsx 6/6 Running 0 3d9h token-refresher-c7dfd99f8-rbv9j 1/1 Running 0 3d9h
However, even after roughly an hour, these pods had not been Terminated:
[tnierman@tnierman] >> oc get po -n openshift-monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 6/6 Running 1 (3d10h ago) 3d10h 10.130.2.7 ip-10-201-58-4.ec2.internal <none> <none> alertmanager-main-1 6/6 Running 1 (3d10h ago) 3d11h 10.131.2.7 ip-10-201-57-179.ec2.internal <none> <none> cluster-monitoring-operator-bf9f46df6-cpjrg 2/2 Running 0 3d12h 10.129.0.44 ip-10-201-59-204.ec2.internal <none> <none> configure-alertmanager-operator-cd4cfd54b-bx2ll 1/1 Running 0 12m 10.130.3.142 ip-10-201-58-4.ec2.internal <none> <none> configure-alertmanager-operator-registry-ws6zs 1/1 Running 0 12m 10.130.3.154 ip-10-201-58-4.ec2.internal <none> <none> kube-state-metrics-54c69d9547-8n4q7 3/3 Running 0 3d10h 10.131.2.37 ip-10-201-57-179.ec2.internal <none> <none> node-exporter-5b84s 2/2 Running 2 3d12h 10.201.59.204 ip-10-201-59-204.ec2.internal <none> <none> node-exporter-6cc65 2/2 Running 2 3d12h 10.201.57.197 ip-10-201-57-197.ec2.internal <none> <none> node-exporter-7ft6p 2/2 Running 2 3d12h 10.201.57.179 ip-10-201-57-179.ec2.internal <none> <none> node-exporter-7rtqn 2/2 Running 2 3d12h 10.201.58.4 ip-10-201-58-4.ec2.internal <none> <none> node-exporter-7tqqf 2/2 Running 2 3d12h 10.201.59.142 ip-10-201-59-142.ec2.internal <none> <none> node-exporter-fdtlt 2/2 Running 2 3d12h 10.201.59.153 ip-10-201-59-153.ec2.internal <none> <none> node-exporter-pn4v5 2/2 Running 2 3d12h 10.201.57.199 ip-10-201-57-199.ec2.internal <none> <none> node-exporter-rs4tw 2/2 Running 0 6m18s 10.201.59.26 ip-10-201-59-26.ec2.internal <none> <none> node-exporter-sxhqr 2/2 Running 2 3d12h 10.201.58.141 ip-10-201-58-141.ec2.internal <none> <none> node-exporter-tdkc5 2/2 Running 2 3d12h 10.201.58.237 ip-10-201-58-237.ec2.internal <none> <none> openshift-state-metrics-86776c5fbf-td6t7 3/3 Running 0 3d10h 10.131.2.42 ip-10-201-57-179.ec2.internal <none> <none> osd-rebalance-infra-nodes-28190565-9bcwk 0/1 Completed 0 39m 10.130.3.124 ip-10-201-58-4.ec2.internal <none> <none> osd-rebalance-infra-nodes-28190580-l98qr 0/1 Completed 0 24m 10.130.3.127 ip-10-201-58-4.ec2.internal <none> <none> osd-rebalance-infra-nodes-28190595-rtgqp 0/1 Completed 0 9m24s 10.131.2.55 ip-10-201-57-179.ec2.internal <none> <none> prometheus-adapter-6f5f68dd7b-fq4fs 1/1 Running 0 3d10h 10.131.2.43 ip-10-201-57-179.ec2.internal <none> <none> prometheus-adapter-6f5f68dd7b-thf84 1/1 Running 0 12m 10.130.3.145 ip-10-201-58-4.ec2.internal <none> <none> prometheus-k8s-0 6/6 Running 0 12m 10.131.0.7 ip-10-201-59-26.ec2.internal <none> <none> prometheus-k8s-1 6/6 Running 0 3d11h 10.131.2.8 ip-10-201-57-179.ec2.internal <none> <none> prometheus-operator-54f8f55f65-rj487 2/2 Running 0 3d10h 10.131.2.47 ip-10-201-57-179.ec2.internal <none> <none> prometheus-operator-admission-webhook-6d6dc5795c-86mk7 1/1 Running 0 3d10h 10.131.2.48 ip-10-201-57-179.ec2.internal <none> <none> prometheus-operator-admission-webhook-6d6dc5795c-v9p45 1/1 Running 0 12m 10.130.3.148 ip-10-201-58-4.ec2.internal <none> <none> sre-dns-latency-exporter-bf9tf 1/1 Running 3 199d 10.128.2.2 ip-10-201-59-142.ec2.internal <none> <none> sre-dns-latency-exporter-kjgmr 0/1 Terminating 15 601d 10.129.4.3 ip-10-201-59-153.ec2.internal <none> <none> sre-dns-latency-exporter-kpzc9 1/1 Running 13 601d 10.129.0.2 ip-10-201-59-204.ec2.internal <none> <none> sre-dns-latency-exporter-p6l2q 1/1 Running 13 601d 10.128.0.4 ip-10-201-57-199.ec2.internal <none> <none> sre-dns-latency-exporter-q2f2w 1/1 Running 13 601d 10.130.0.5 ip-10-201-58-237.ec2.internal <none> <none> sre-dns-latency-exporter-r5tdm 1/1 Running 14 601d 10.130.2.6 ip-10-201-58-4.ec2.internal <none> <none> sre-dns-latency-exporter-sk9ms 1/1 Running 14 601d 10.131.2.2 ip-10-201-57-179.ec2.internal <none> <none> sre-dns-latency-exporter-x62x8 1/1 Running 3 199d 10.128.4.4 ip-10-201-57-197.ec2.internal <none> <none> sre-dns-latency-exporter-xrnqd 0/1 Terminating 2 199d 10.131.4.7 ip-10-201-58-141.ec2.internal <none> <none> sre-dns-latency-exporter-xxzjw 1/1 Running 0 6m18s 10.131.0.4 ip-10-201-59-26.ec2.internal <none> <none> sre-ebs-iops-reporter-3-mwf64 0/1 Terminating 0 3d11h 10.129.4.27 ip-10-201-59-153.ec2.internal <none> <none> sre-ebs-iops-reporter-3-xrltz 1/1 Running 0 62m 10.130.3.116 ip-10-201-58-4.ec2.internal <none> <none> sre-stuck-ebs-vols-3-j7btr 1/1 Running 0 12m 10.130.3.138 ip-10-201-58-4.ec2.internal <none> <none> telemeter-client-6dbcbb9dbf-tfhl8 3/3 Running 0 12m 10.130.3.152 ip-10-201-58-4.ec2.internal <none> <none> thanos-querier-8dcc7884-5w5sd 6/6 Running 0 12m 10.130.3.153 ip-10-201-58-4.ec2.internal <none> <none> thanos-querier-8dcc7884-d8gsx 6/6 Running 0 3d10h 10.131.2.45 ip-10-201-57-179.ec2.internal <none> <none> token-refresher-c7dfd99f8-rbv9j 1/1 Running 0 3d10h 10.131.2.44 ip-10-201-57-179.ec2.internal <none> <none>
Full description of one of the newly-terminated pods:
[tnierman@tnierman] >> oc describe pod sre-ebs-iops-reporter-3-mwf64 -n openshift-monitoring
Name: sre-ebs-iops-reporter-3-mwf64
Namespace: openshift-monitoring
Priority: 0
Node: ip-10-201-59-153.ec2.internal/10.201.59.153
Start Time: Fri, 04 Aug 2023 02:46:54 -0500
Labels: deployment=sre-ebs-iops-reporter-3
deploymentconfig=sre-ebs-iops-reporter
name=sre-ebs-iops-reporter
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.129.4.27"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.129.4.27"
],
"default": true,
"dns": {}
}]
openshift.io/deployment-config.latest-version: 3
openshift.io/deployment-config.name: sre-ebs-iops-reporter
openshift.io/deployment.name: sre-ebs-iops-reporter-3
openshift.io/scc: restricted-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Terminating (lasts 70m)
Termination Grace Period: 30s
IP: 10.129.4.27
IPs:
IP: 10.129.4.27
Controlled By: ReplicationController/sre-ebs-iops-reporter-3
Init Containers:
setupcreds:
Container ID: cri-o://0e2e1e711e8ce23933e0163815390f75e2faf9f47eaa525b00ae2e52b8508769
Image: quay.io/app-sre/managed-prometheus-exporter-initcontainer:latest
Image ID: quay.io/app-sre/managed-prometheus-exporter-initcontainer@sha256:f859874cf8ef92e8e806ff615f33472992917545ec94d461caa8e6e13b8a1983
Port: <none>
Host Port: <none>
Command:
/usr/local/bin/init.py
-r
/secrets/aws/config.ini
-a
/rawsecrets/aws_access_key_id
-A
/rawsecrets/aws_secret_access_key
-o
/secrets/aws/credentials.ini
-c
/config/env/CLUSTERID
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 04 Aug 2023 02:47:00 -0500
Finished: Fri, 04 Aug 2023 02:47:07 -0500
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/config from envfiles (rw)
/etc/pki/ca-trust/extracted/pem from trusted-ca-bundle (ro)
/rawsecrets from awsrawcreds (ro)
/secrets from secrets (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v6kld (ro)
Containers:
main:
Container ID:
Image: quay.io/app-sre/managed-prometheus-exporter-base:latest
Image ID:
Port: 8080/TCP
Host Port: 0/TCP
Command:
/bin/sh
/monitor/start.sh
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Liveness: http-get http://:8080/ delay=420s timeout=240s period=360s #success=1 #failure=2
Readiness: http-get http://:8080/ delay=3s timeout=240s period=10s #success=1 #failure=3
Environment:
AWS_SHARED_CREDENTIALS_FILE: /secrets/aws/credentials.ini
AWS_CONFIG_FILE: /secrets/aws/config.ini
PYTHONPATH: /openshift-python/packages:/support/packages
Mounts:
/config from envfiles (ro)
/etc/pki/ca-trust/extracted/pem from trusted-ca-bundle (ro)
/monitor from monitor-volume (ro)
/secrets from secrets (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v6kld (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
awsrawcreds:
Type: Secret (a volume populated by a Secret)
SecretName: sre-ebs-iops-reporter-aws-credentials
Optional: false
secrets:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
envfiles:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
monitor-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: sre-ebs-iops-reporter-code
Optional: false
trusted-ca-bundle:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: sre-ebs-iops-reporter-trusted-ca-bundle
Optional: false
kube-api-access-v6kld:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/infra:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Attempting to terminate the pod manually causes crio to return an error:
sh-4.4# crictl rmp sre-ebs-iops-reporter-3-mwf64 getting sandbox status of pod "sre-ebs-iops-reporter-3-mwf64": rpc error: code = NotFound desc = could not find pod "sre-ebs-iops-reporter-3-mwf64": PodSandbox with ID starting with sre-ebs-iops-reporter-3-mwf64 not found: ID does not exist
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Unknown
Steps to Reproduce:
1. Put pod into an ImagePullBackOff 2. Attempt to delete it 3. It will get stuck terminating
Actual results:
Pod does not terminate
Expected results:
Pod terminates
Additional info: