-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.15.z
-
None
-
None
-
False
-
Description of problem:
After performing Disaster Recovery Test customer's application pods can be seen with CreateContainerError because requests are timing out in the CRI-O due to system load.
Version-Release number of selected component (if applicable):
v4.15.39
How reproducible:
Disaster Recovery Test has been done twice and the issue has been observed from both tests.
Steps to Reproduce:
Disaster Recovery Test steps: 1. Run 'aws:elasticache:interrupt-cluster-az-power''aws:network:disrupt-connectivity' on the node in AZ-1a using FIS to make AZ-a network unavailable. 2. Confirm that the primary node has switched to the node in AZ-1d. 3. Stop the FIS executed in step 1. 4. Make the node in AZ-a primary again following: https://docs.aws.amazon.com/ja_jp/AmazonElastiCache/latest/dg/Replication.PromoteReplica.html
Actual results:
There were pods stuck in ContainerCreateError. First time 3 pods: ----------------------- kubelet Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_a-jikosvc_a-jikosvc-58c796d8d-65fzk_sp1-app_7b03c25e-cab8-4a79-9002-f0076d6036da_2 for id 96b6ed6f763c73ef32631f09b1aa7fa1b3715544244aba6da6444f4fa5b33374: name is reserved (on node ip-10-181-145-245.ap-northeast-1.compute.internal AZ-1c) ----------------------- ----------------------- kubelet Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_i-resche_i-resche-7bf8fcd4bf-x242h_sp1-app_eaea7490-60ec-418b-a413-341c48df4369_1 for id 438732d8122854cffa1f69c43c4ecdf85361ddb91ccad133f9ffb3efec268a5f: name is reserved (on node ip-10-181-145-52.ap-northeast-1.compute.internal AZ-1c) ----------------------- ----------------------- kubelet Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_r-newwebsvc_r-newwebsvc-7659fff7d8-hdwh4_sp1-app_9a00ba93-0808-4495-9007-2b3f27853df0_1 for id 8a69a8305811daf260cf936bc2d0a4a55291465650bc1cc3a33ee96e614cbeca: name is reserved (on node ip-10-181-146-5.ap-northeast-1.compute.internal AZ-1d) ----------------------- Second time 1 pod: ----------------------- Feb 28 03:06:41.522659 ip-10-181-146-5 kubenswrapper[2629]: E0228 03:06:41.522623 2629 remote_runtime.go:319] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_i-shussaisvc_i-shussaisvc-6f5499df7f-svd9j_sp1-app_9bf8aa26-7829-4dd1-8cfe-f242de70ea09_1 for id d6984b1ec49aa23e72fac71d02a4d8a6ba0dd46a848ad759b8ac8a4a27fb8e27: name is reserved" podSandboxID="02827a1691c85dde2a71cde222eae78dba2fd079df458b145d3ec3019e780983" (on node ip-10-181-146-5.ap-northeast-1.compute.internal AZ-1d) ----------------------- Pods with above issue were observed several mins after starting Test to AZ-1a. Issue happened after above pods are rescheduled to not impacted AZ(AZ-1c, AZ-1d) and resumed directly from containercreateerror to running after stopping FIS without rebooting nodes. Expected results:
No CRI-O issue after Disaster Recovery Test.
Additional info: