-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18
-
None
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Simultaneously creating a large number of pods (e.g., 700-800) on an HCP (Hosted Control Plane) cluster leads to multiple pods remaining in the ContainerCreating state for an extended period. The events for these pods show repeated FailedCreatePodSandBox warnings. This behavior is not reproducible on the equivalent management cluster, suggesting a performance or scaling issue specific to the hosted cluster architecture or its networking stack (OVN-Kubernetes) under stress.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a HCP cluster with the ability to occupy 800+ pods in a single node.
2. Create any deployment.
3. Make sure there is no issue with image pull timeout such as QPS or pulling image from openshift image registry and if possible prefer image from quay.io.
3. Scaling the deployment to 800-1000
Actual results:
Pod stuck in ContainerCreating state for a long time:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-685b4f4bcc-24khb_test-vedant_44b2ed44-52c4-499e-a6c0-4ffb3b3972e6_0(947dab718187b327075cc5cb1f7859a65832f357f286f1294e872646562d9124): error adding pod test-vedant_test-685b4f4bcc-24khb to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"947dab718187b327075cc5cb1f7859a65832f357f286f1294e872646562d9124" Netns:"/var/run/netns/c2d80893-9998-41f9-96b6-0a7d4231c158" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=test-vedant;K8S_POD_NAME=test-685b4f4bcc-24khb;K8S_POD_INFRA_CONTAINER_ID=947dab718187b327075cc5cb1f7859a65832f357f286f1294e872646562d9124;K8S_POD_UID=44b2ed44-52c4-499e-a6c0-4ffb3b3972e6" Path:"" ERRORED: error configuring pod [test-vedant/test-685b4f4bcc-24khb] networking: [test-vedant/test-685b4f4bcc-24khb/44b2ed44-52c4-499e-a6c0-4ffb3b3972e6:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[test-vedant/test-685b4f4bcc-24khb 947dab718187b327075cc5cb1f7859a65832f357f286f1294e872646562d9124 network default NAD default] [test-vedant/test-685b4f4bcc-24khb 947dab718187b327075cc5cb1f7859a65832f357f286f1294e872646562d9124 network default NAD default] failed to configure pod interface: failed to run 'ovs-vsctl --timeout=30 --if-exists clear port 947dab718187b32 qos': exit status 1 "ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n" ' ': StdinData: {"binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator,openshift-cnv","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}
Expected results:
Pods should get scheduled quickly.
Additional info:
Customers hcp is on agent platform and I tested it on a hcp kubevirt platform. Observed similar behaviour.
Affected Platforms:
Is it an
- customer issue / SD
If it is a customer / SD issue:
- Adding cluster details and logs in the drive.