-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.13
-
Important
-
No
-
SDN Sprint 247, SDN Sprint 248, SDN Sprint 249, SDN Sprint 250
-
4
-
False
-
Description of problem:
When running cluster-density-v2 with 2268 Iterations on a 252-node on 4.13 ARO setup, the overall podLatency times are extremely high
The OCP Cluster was created on ARO which had following Instance type:
Master Type: Standard_D32s_v5 Worker Type: Standard_D8s_v5 Infra Type: Standard_E16s_v5
Version-Release number of selected component (if applicable):
4.13.22
How reproducible:
This is reproducible, an is only seen at scale
Steps to Reproduce:
1. Run cluster-density-v2 with 2268 namespaces on 252 nodes 2. git clone https://github.com/cloud-bulldozer/e2e-benchmarking; cd e2e-benchmarking/workloads/kube-burner-ocp-wrapper 3. ITERATIONS=2268 WORKLOAD=cluster-density-v2 ./run.sh
Actual results:
Pod's readinessprobe impacts the Pod state.
Expected results:
All the pods should be in Running state
Additional info:
$ oc describe po client-1-c7d4c6df6-rth25 -n cluster-density-v2-596 Name: client-1-c7d4c6df6-rth25 Namespace: cluster-density-v2-596 Priority: 0 Service Account: default Node: krishvoor-scale-2hfcr-worker-eastus1-dhpmh/10.0.2.168 Start Time: Wed, 27 Dec 2023 12:28:13 +0530 Labels: app=client kube-burner-index=3 kube-burner-job=cluster-density-v2 kube-burner-runid=a5fa9b9b-8a3d-4986-9025-e4b03efcdb85 kube-burner-uuid=a24d848b-ba4e-4ee7-8b41-472f6ff881a2 name=client-1 pod-template-hash=c7d4c6df6 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.130.34.110/23"],"mac_address":"0a:58:0a:82:22:6e","gateway_ips":["10.130.34.1"],"ip_address":"10.130.34.11... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.130.34.110" ], "mac": "0a:58:0a:82:22:6e", "default": true, "dns": {} }] openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running IP: 10.130.34.110 IPs: IP: 10.130.34.110 Controlled By: ReplicaSet/client-1-c7d4c6df6 Containers: client-app: Container ID: cri-o://4241e52f372ebc1ed9eeed4e865fa9b6a995079d0ee4b9138d65b16773a5273d Image: quay.io/cloud-bulldozer/curl:latest Image ID: quay.io/cloud-bulldozer/curl@sha256:4311823d3576c0b7330beccbe09896ff0378c9c1c6f6974ff9064af803fed766 Port: <none> Host Port: <none> Command: sleep inf State: Running Started: Wed, 27 Dec 2023 12:28:21 +0530 Ready: False Restart Count: 0 Requests: cpu: 10m memory: 10Mi Readiness: exec [/bin/sh -c curl --fail -sS ${SERVICE_ENDPOINT} -o /dev/null && curl --fail -sSk ${ROUTE_ENDPOINT} -o /dev/null] delay=0s timeout=5s period=10s #success=1 #failure=3 Environment: ENVVAR1: kc6ykh5F7XP7fVbKHSQJJnSA3lgfsXR0ue0AskN8id0W4JkMF8iozCwPvIHamgaHUwnq4xgKKdjc9Xdw4M2AEaHCQUVwwyxtxI5yopzUsQ91iF5DH6qM5sCLWSG0qqjatzh42AjGAUR1qxojcse9X927umXkCO1pwIUOQIBvDINtSrPAcSDKrVJIEUA6tzpFCfxgYW3kKv2CPaXsMtcAeDB5ZGgjyrx5CZ0jRLPXuwP4HCUWP3srfPPfQK ENVVAR2: xqX8MhvWT3Acp2WYUHgwsK2N9fgialvYDbWjghDQcVVinlz32l5ygQI2d9PjhCLDHWiwrYiNaikConbuicwRhDv4hhjF4YTQbqNg2y0Rlt6EoGc0AUE1PPzkd3mJJe5X9IHnc3gdFk7hIrA7p1aS8fhoOchzk7oxnBOJ0iAPtKkVupWAeC9zzmDYjEpPMfK49Ll0E9CTfc7cz5uEvN5cqwYQvE4NpAJLrQjhz7JmOBeF1bYNXtOnvsZSRx ENVVAR3: 1iBE9wYFnMwCWeg9CcSHCrPtL5CJ2WcJyS6jPrrXHWf860Gr5jpyaEk0OuctkVkym3KtncUNMgjfC8iLls49x6DOxktqxDCuqc6Mea5p9gzRcRXOlTfnm4Yd3ILy9paYKsxCs8Kl2ipYAfpxWVRvjKic0hcPWrQWjXY4jEE8cg4PVJYjFrXskjBgptlV1B9W2gmGaB3GFuzDwwBtHpEW8EXjeEEKyJStKzzdeLkyAUoRS2YeEYEcTO6y6S ENVVAR4: UFVArwbf7cfLH1CPtcKlNWKaoVTwW0ZG2Q78sVKTz75VpG4oBxItbnIwEKkbhwbxweWVxF2qfIwcyYTqyg2FefBRcxWwPs4Yxrheqs0uUAeewo2dGoOSW6iQTaMTKXmGDpFB17p2hWWXymwtxwedhLR56XBSW3Uyaqb3p7vnWCnYu6UVdF81ztBIyq4zA4hnWwIBcQ5HL43WxGmKr4iUF4U1Wj7OyTqD5YyzByhervnyMHOR5myFThtSPD ROUTE_ENDPOINT: https://cluster-density-1-cluster-density-v2-596.apps.w1i1fsrv.eastus.aroapp.io/256.html SERVICE_ENDPOINT: http://cluster-density-3/256.html Mounts: /configmap1 from configmap-1 (rw) /configmap2 from configmap-2 (rw) /configmap3 from configmap-3 (rw) /configmap4 from configmap-4 (rw) /etc/podlabels from podinfo (rw) /secret1 from secret-1 (rw) /secret2 from secret-2 (rw) /secret3 from secret-3 (rw) /secret4 from secret-4 (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6q7tk (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: secret-1: Type: Secret (a volume populated by a Secret) SecretName: cluster-density-v2-1 Optional: false secret-2: Type: Secret (a volume populated by a Secret) SecretName: cluster-density-v2-2 Optional: false secret-3: Type: Secret (a volume populated by a Secret) SecretName: cluster-density-v2-3 Optional: false secret-4: Type: Secret (a volume populated by a Secret) SecretName: cluster-density-v2-4 Optional: false configmap-1: Type: ConfigMap (a volume populated by a ConfigMap) Name: cluster-density-v2-1 Optional: false configmap-2: Type: ConfigMap (a volume populated by a ConfigMap) Name: cluster-density-v2-2 Optional: false configmap-3: Type: ConfigMap (a volume populated by a ConfigMap) Name: cluster-density-v2-3 Optional: false configmap-4: Type: ConfigMap (a volume populated by a ConfigMap) Name: cluster-density-v2-4 Optional: false podinfo: Type: DownwardAPI (a volume populated by information about the pod) Items: metadata.labels -> labels kube-api-access-6q7tk: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Topology Spread Constraints: kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app=client Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 10m default-scheduler Successfully assigned cluster-density-v2-596/client-1-c7d4c6df6-rth25 to krishvoor-scale-2hfcr-worker-eastus1-dhpmh Normal AddedInterface 10m multus Add eth0 [10.130.34.110/23] from ovn-kubernetes Normal Pulled 10m kubelet Container image "quay.io/cloud-bulldozer/curl:latest" already present on machine Normal Created 10m kubelet Created container client-app Normal Started 10m kubelet Started container client-app Warning Unhealthy 10m kubelet Readiness probe failed: curl: (7) Failed to connect to cluster-density-3 port 80 after 2 ms: Couldn't connect to server Warning Unhealthy 10m kubelet Readiness probe failed: curl: (7) Failed to connect to cluster-density-3 port 80 after 0 ms: Couldn't connect to server Warning Unhealthy 10m kubelet Readiness probe failed: curl: (7) Failed to connect to cluster-density-3 port 80 after 1 ms: Couldn't connect to server Warning Unhealthy 10m kubelet Readiness probe failed: curl: (22) The requested URL returned error: 503 Warning Unhealthy 13s (x36 over 9m53s) kubelet Readiness probe failed: command timed out
Ping test to a Infra Node:
$ oc debug node/krishvoor-scale-2hfcr-infra-aro-machinesets-eastus-2-tbb7b Starting pod/krishvoor-scale-2hfcr-infra-aro-machinesets-eastus-2-tbb7b-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.2.11 If you don't see a command prompt, try pressing enter. sh-4.4# ping 10.0.2.149 PING 10.0.2.149 (10.0.2.149) 56(84) bytes of data. 64 bytes from 10.0.2.149: icmp_seq=3 ttl=64 time=5.77 ms 64 bytes from 10.0.2.149: icmp_seq=4 ttl=64 time=8.90 ms 64 bytes from 10.0.2.149: icmp_seq=9 ttl=64 time=4.38 ms 64 bytes from 10.0.2.149: icmp_seq=12 ttl=64 time=2.78 ms ^C --- 10.0.2.149 ping statistics --- 12 packets transmitted, 4 received, 66.6667% packet loss, time 11218ms rtt min/avg/max/mdev = 2.783/5.458/8.896/2.250 ms
==================
Between workers the ping test was successful:
$ oc debug node/krishvoor-scale-2hfcr-worker-eastus3-xj2tj Starting pod/krishvoor-scale-2hfcr-worker-eastus3-xj2tj-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.2.112 If you don't see a command prompt, try pressing enter. sh-4.4# ping 10.0.2.159 PING 10.0.2.159 (10.0.2.159) 56(84) bytes of data. 64 bytes from 10.0.2.159: icmp_seq=1 ttl=64 time=2.82 ms 64 bytes from 10.0.2.159: icmp_seq=2 ttl=64 time=7.95 ms 64 bytes from 10.0.2.159: icmp_seq=3 ttl=64 time=9.74 ms 64 bytes from 10.0.2.159: icmp_seq=4 ttl=64 time=7.65 ms 64 bytes from 10.0.2.159: icmp_seq=5 ttl=64 time=0.450 ms 64 bytes from 10.0.2.159: icmp_seq=6 ttl=64 time=5.92 ms 64 bytes from 10.0.2.159: icmp_seq=7 ttl=64 time=0.538 ms 64 bytes from 10.0.2.159: icmp_seq=8 ttl=64 time=0.692 ms 64 bytes from 10.0.2.159: icmp_seq=9 ttl=64 time=0.875 ms ^C — 10.0.2.159 ping statistics — 9 packets transmitted, 9 received, 0% packet loss, time 8097ms rtt min/avg/max/mdev = 0.450/4.069/9.736/3.529 ms $
must-gather: http://perf1.perf.lab.eng.bos.redhat.com/pub/mukrishn/OCPBUGS-26530/
- is duplicated by
-
OCPBUGS-25876 [ARO] Pod Latency is very high at 252 Nodes
- Closed