[OCPBUGS-52451] Disaster Recovery Test: Pods stuck in CreateContainerError because requests are timing out in the CRI-O due to system load - Red Hat Issue Tracker

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15.z
Component/s: Node / CRI-O
Labels:
None

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

After performing Disaster Recovery Test customer's application pods can be seen with CreateContainerError because requests are timing out in the CRI-O due to system load.

Version-Release number of selected component (if applicable):

v4.15.39

How reproducible:

Disaster Recovery Test has been done twice and the issue has been observed from both tests.

Steps to Reproduce:

Disaster Recovery Test steps:
1. Run 'aws:elasticache:interrupt-cluster-az-power''aws:network:disrupt-connectivity' on the node in AZ-1a using FIS to make AZ-a network unavailable.
2. Confirm that the primary node has switched to the node in AZ-1d.
3. Stop the FIS executed in step 1.
4. Make the node in AZ-a primary again following: https://docs.aws.amazon.com/ja_jp/AmazonElastiCache/latest/dg/Replication.PromoteReplica.html

Actual results:

There were pods stuck in ContainerCreateError.

First time 3 pods:
-----------------------
kubelet Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_a-jikosvc_a-jikosvc-58c796d8d-65fzk_sp1-app_7b03c25e-cab8-4a79-9002-f0076d6036da_2 for id 96b6ed6f763c73ef32631f09b1aa7fa1b3715544244aba6da6444f4fa5b33374: name is reserved

(on node ip-10-181-145-245.ap-northeast-1.compute.internal　AZ-1c)
-----------------------

-----------------------
 kubelet  Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_i-resche_i-resche-7bf8fcd4bf-x242h_sp1-app_eaea7490-60ec-418b-a413-341c48df4369_1 for id 438732d8122854cffa1f69c43c4ecdf85361ddb91ccad133f9ffb3efec268a5f: name is reserved

(on node ip-10-181-145-52.ap-northeast-1.compute.internal　AZ-1c)
-----------------------

-----------------------
kubelet  Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_r-newwebsvc_r-newwebsvc-7659fff7d8-hdwh4_sp1-app_9a00ba93-0808-4495-9007-2b3f27853df0_1 for id 
8a69a8305811daf260cf936bc2d0a4a55291465650bc1cc3a33ee96e614cbeca: name is reserved

(on node ip-10-181-146-5.ap-northeast-1.compute.internal AZ-1d)
-----------------------

Second time 1 pod:
-----------------------
Feb 28 03:06:41.522659 ip-10-181-146-5 kubenswrapper[2629]: E0228 03:06:41.522623    2629 remote_runtime.go:319] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_i-shussaisvc_i-shussaisvc-6f5499df7f-svd9j_sp1-app_9bf8aa26-7829-4dd1-8cfe-f242de70ea09_1 for id d6984b1ec49aa23e72fac71d02a4d8a6ba0dd46a848ad759b8ac8a4a27fb8e27: name is reserved" podSandboxID="02827a1691c85dde2a71cde222eae78dba2fd079df458b145d3ec3019e780983"

(on node ip-10-181-146-5.ap-northeast-1.compute.internal AZ-1d)
-----------------------

Pods with above issue were observed several mins after starting Test to AZ-1a.
Issue happened after above pods are rescheduled to not impacted AZ(AZ-1c, AZ-1d) and resumed directly from containercreateerror to running after stopping FIS without rebooting nodes.

Expected results:

No CRI-O issue after Disaster Recovery Test.

Additional info:

Assignee:: Cameron Meadors

Reporter:: Miao Qiao

QA Contact:: Cameron Meadors

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/03/06 2:31 AM

Updated:: 2025/04/03 3:25 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide