-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
4.8.z
-
Important
-
None
-
Rejected
-
False
-
Description of problem:
pods sometimes slow to start
Version-Release number of selected component (if applicable):
4.8
How reproducible:
Often
Steps to Reproduce:
1. Continuously provision bit bucket operator build pods 2. Observe from bit bucket, the time it takes for pods to come up 3.
Actual results:
Pods some times are slow to start.
Expected results:
Pods should consistently start in roughly the same time
Additional info:
1. Customers Cluste runs the gitlab operator that constantly creates and destroys build pods. Master pod (ops master) checks/polls Gitlab server (deployed outside of OCP) for any new jobs and deploys build pods as it sees new jobs and run them in the build pods. 2. Out of a total of ~164,000 jobs, roughly 7,300 jobs have taken longer than a minute. In some cases the jobs were taking as long as 8 minutes. A minute is considered good since in a large number of cases the job of deploying a build pod only takes seconds (like under 30 seconds). 3. The cluster has 3 dedicated nodes for running the builds, and on average 7000 pods are created and destroyed every hour on these 3 nodes (average pods during night time is around 2000) 4. We notice two types of delays from CRIO logs, in some cases the create sandbox timeout and we notice the "name is reserved" error like below. It is because the kubelet retries the operation. In cases like this the pod may finally start after 8 minutes. Sep 30 14:23:40 ocpdev-be-build101 crio[2565]: time="2022-09-30 14:23:40.627320889Z" level=warning msg="Error reserving pod name k8s_runner-yxfs4sne-project-8053-concurrent-78gkkj_gitlab_32d17277-e626-499e-8754-70b8756498e3_0 for id de6e5867d9932d1b0730763b88195dbde3f36b20ae677a17bb11cf5c38f87001: name is reserved" file="lib/container_server.go:508" 5. Defrag of etcd / moving to SSD etc. did not have any impact on the issue. Cu used to restart nodes in the beginning that was providing some relief, but the issue always comes back. In other cases, there is simply a delay between "Running Sandbox" step and "Got Pod Network" step. Notice a ~ 2 minute gap below. Sep 30 14:53:02 ocpdev-be-build101 crio[2565]: time="2022-09-30 14:53:02.810610153Z" level=info msg="Running pod sandbox: gitlab/runner-yxfs4sne-project-69-concurrent-2bwtbp/POD" file="server/sandbox_run_linux.go:283" id=edf8393f-0df5-4038-873e-96bc4d852d97 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Sep 30 14:55:26 ocpdev-be-build101 crio[2565]: time="2022-09-30 14:55:26.294163986Z" level=info msg="Got pod network &{Name:runner-yxfs4sne-project-69-concurrent-2bwtbp Namespace:gitlab ID:9760923edd7cc18ae94f21cca074044934ca1b824e266645180067941d891f1f UID:448f0969-1d16-4554-b781-4102d9b81293 NetNS:/var/run/netns/ac36c589-60d8-4496-be44-09c8b438e0c8 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" file="ocicni/ocicni.go:797" 6. The net effect for customer is that, from a gitlab perspective, the end users (developers) see that the job is waiting for the pod to get ready for long (sometimes upto 8 minutes) and the timing is inconsistent. 7. Cu rebooted the worker nodes which made the problem go away, but it comes back again. From this it looks like CRIO may be getting into some congestion state with too many create/delete operations, but logs dont show much for the timing gaps. 8. Cu also tried to upgrade to 4.8.47 from 4.8.42 as a related fix in https://bugzilla.redhat.com/show_bug.cgi?id=2082344 was hoped to address the issue, but it did not help customer and it caused additional issues with image pulls, co customer moved to 4.8.42 again. 9. The case has crio / mg / sos report logs and other observations/analysis