Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-52451

Disaster Recovery Test: Pods stuck in CreateContainerError because requests are timing out in the CRI-O due to system load

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.15.z
    • Node / CRI-O
    • None
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      After performing Disaster Recovery Test customer's application pods can be seen with CreateContainerError because requests are timing out in the CRI-O due to system load.
      
      
      

      Version-Release number of selected component (if applicable):

      v4.15.39

      How reproducible:

      Disaster Recovery Test has been done twice and the issue has been observed from both tests.

      Steps to Reproduce:

      Disaster Recovery Test steps:
      1. Run 'aws:elasticache:interrupt-cluster-az-power''aws:network:disrupt-connectivity' on the node in AZ-1a using FIS to make AZ-a network unavailable.
      2. Confirm that the primary node has switched to the node in AZ-1d.
      3. Stop the FIS executed in step 1.
      4. Make the node in AZ-a primary again following: https://docs.aws.amazon.com/ja_jp/AmazonElastiCache/latest/dg/Replication.PromoteReplica.html    

      Actual results:

      There were pods stuck in ContainerCreateError.
      
      First time 3 pods:
      -----------------------
      kubelet Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_a-jikosvc_a-jikosvc-58c796d8d-65fzk_sp1-app_7b03c25e-cab8-4a79-9002-f0076d6036da_2 for id 96b6ed6f763c73ef32631f09b1aa7fa1b3715544244aba6da6444f4fa5b33374: name is reserved
      
      (on node ip-10-181-145-245.ap-northeast-1.compute.internal AZ-1c)
      -----------------------
      
      -----------------------
       kubelet  Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_i-resche_i-resche-7bf8fcd4bf-x242h_sp1-app_eaea7490-60ec-418b-a413-341c48df4369_1 for id 438732d8122854cffa1f69c43c4ecdf85361ddb91ccad133f9ffb3efec268a5f: name is reserved
      
      (on node ip-10-181-145-52.ap-northeast-1.compute.internal AZ-1c)
      -----------------------
      
      -----------------------
      kubelet  Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_r-newwebsvc_r-newwebsvc-7659fff7d8-hdwh4_sp1-app_9a00ba93-0808-4495-9007-2b3f27853df0_1 for id 
      8a69a8305811daf260cf936bc2d0a4a55291465650bc1cc3a33ee96e614cbeca: name is reserved
      
      (on node ip-10-181-146-5.ap-northeast-1.compute.internal AZ-1d)
      -----------------------
      
      Second time 1 pod:
      -----------------------
      Feb 28 03:06:41.522659 ip-10-181-146-5 kubenswrapper[2629]: E0228 03:06:41.522623    2629 remote_runtime.go:319] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_i-shussaisvc_i-shussaisvc-6f5499df7f-svd9j_sp1-app_9bf8aa26-7829-4dd1-8cfe-f242de70ea09_1 for id d6984b1ec49aa23e72fac71d02a4d8a6ba0dd46a848ad759b8ac8a4a27fb8e27: name is reserved" podSandboxID="02827a1691c85dde2a71cde222eae78dba2fd079df458b145d3ec3019e780983"
      
      (on node ip-10-181-146-5.ap-northeast-1.compute.internal AZ-1d)
      -----------------------
      
      Pods with above issue were observed several mins after starting Test to AZ-1a.
      Issue happened after above pods are rescheduled to not impacted AZ(AZ-1c, AZ-1d) and resumed directly from containercreateerror to running after stopping FIS without rebooting nodes.
      
      Expected results:
      No CRI-O issue after Disaster Recovery Test.

      Additional info:

          

              cmeadors@redhat.com Cameron Meadors
              rhn-support-hqiao Miao Qiao
              Cameron Meadors Cameron Meadors
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: