Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-28377

OCP 4.12 CNI request failed with status 400 failed to get pod annotation: timed out waiting for annotations: context deadline exceeded

XMLWordPrintable

    • Important
    • No
    • SDN Sprint 250
    • 1
    • False
    • Hide

      None

      Show
      None
    • Large cluster launching 128 pods all at once; work around is to set pytorch jobs to restart

      ENV:
      Cluster Version: 4.12.30

      Infrastructure
      --------------
      Platform: IBMCloud
      Install Type: UPI

      Network
      -------
      Network Type: OVNKubernetes

      Description of problem:
      The OpenShift cluster is large, with greater than 200 workers. When we launch pytorchjobs at scales of 128 pods, we often have a single pod that fails with error messages related to CNI request failed and timed out waiting for annotations, context deadline exceeded.

      Its not 100 percent replicable. After multiple pytorchjob launch attempts, we often do successfully get the full 128 job launched. But it is not without many attempts that cause cluster users time delays.

      ~~~
      Warning FailedCreatePodSandBox 85s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_rp-granite-code-8b-4k-l1-r1-worker-48_granite-prod_9e1ea9da-42aa-4bd4-a34a-6381a96e6590_0(47fe6bb8e447ef3175c9f00043a0f76478028cc23a26fd7f72864235f524ec3f): error adding pod granite-prod_rp-granite-code-8b-4k-l1-r1-worker-48 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [granite-prod/rp-granite-code-8b-4k-l1-r1-worker-48/9e1ea9da-42aa-4bd4-a34a-6381a96e6590:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[granite-prod/rp-granite-code-8b-4k-l1-r1-worker-48 47fe6bb8e447ef3175c9f00043a0f76478028cc23a26fd7f72864235f524ec3f] [granite-prod/rp-granite-code-8b-4k-l1-r1-worker-48 47fe6bb8e447ef3175c9f00043a0f76478028cc23a26fd7f72864235f524ec3f] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
      ~~~

              jtanenba@redhat.com Jacob Tanenbaum
              rhn-support-dseals Daniel Seals
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: