Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.8.z
Component/s: Node / CRI-O
Labels:
- bugbash-226

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

pods sometimes slow to start

Version-Release number of selected component (if applicable):

4.8

How reproducible:

Often

Steps to Reproduce:

1. Continuously provision bit bucket operator build pods
2. Observe from bit bucket, the time it takes for pods to come up
3.

Actual results:

Pods some times are slow to start.

Expected results:

Pods should consistently start in roughly the same time

Additional info:

1. Customers Cluste runs the gitlab operator that constantly creates and destroys build pods.  Master pod (ops master) checks/polls Gitlab server (deployed outside of OCP) for any new jobs and deploys build pods as it sees new jobs and run them in the build pods.

2. Out of a total of ~164,000 jobs, roughly 7,300 jobs have taken longer than a minute. In some cases the jobs were taking as long as 8 minutes.  A minute is considered good since in a large number of cases the job of deploying a build pod only takes seconds (like under 30 seconds).

3. The cluster has 3 dedicated nodes for running the builds, and on average 7000 pods are created and destroyed every hour on these 3 nodes (average pods during night time is around 2000)

4. We notice two types of delays from CRIO logs, in some cases the create sandbox timeout and we notice the "name is reserved" error like below. It is because the kubelet retries the operation. In cases like this the pod may finally start after 8 minutes.


Sep 30 14:23:40 ocpdev-be-build101 crio[2565]: time="2022-09-30 14:23:40.627320889Z" level=warning msg="Error reserving pod name k8s_runner-yxfs4sne-project-8053-concurrent-78gkkj_gitlab_32d17277-e626-499e-8754-70b8756498e3_0 for id de6e5867d9932d1b0730763b88195dbde3f36b20ae677a17bb11cf5c38f87001: name is reserved" file="lib/container_server.go:508"  5. Defrag of etcd / moving to SSD etc. did not have any impact on the issue. Cu used to restart nodes in the beginning that was providing some relief, but the issue always comes back.

In other cases, there is simply a delay between "Running Sandbox" step and "Got Pod Network"  step. Notice a ~ 2 minute gap below.


Sep 30 14:53:02 ocpdev-be-build101 crio[2565]: time="2022-09-30 14:53:02.810610153Z" level=info msg="Running pod sandbox: gitlab/runner-yxfs4sne-project-69-concurrent-2bwtbp/POD" file="server/sandbox_run_linux.go:283" id=edf8393f-0df5-4038-873e-96bc4d852d97 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
 
 
Sep 30 14:55:26 ocpdev-be-build101 crio[2565]: time="2022-09-30 14:55:26.294163986Z" level=info msg="Got pod network &{Name:runner-yxfs4sne-project-69-concurrent-2bwtbp Namespace:gitlab ID:9760923edd7cc18ae94f21cca074044934ca1b824e266645180067941d891f1f UID:448f0969-1d16-4554-b781-4102d9b81293 NetNS:/var/run/netns/ac36c589-60d8-4496-be44-09c8b438e0c8 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" file="ocicni/ocicni.go:797"


6. The net effect for customer is that, from a gitlab perspective, the end users (developers)  see that the job is waiting for the pod to get ready  for long (sometimes upto 8 minutes) and the timing is inconsistent.


7. Cu rebooted the worker nodes which made the problem go away, but it comes back again.  From this it looks like CRIO may be getting into some congestion state with too many create/delete operations, but logs dont show much for the timing gaps.


8.  Cu also tried to upgrade to 4.8.47 from 4.8.42  as a related fix in https://bugzilla.redhat.com/show_bug.cgi?id=2082344 was hoped to address the issue,  but it did not help customer and it caused additional issues with image pulls, co customer moved to 4.8.42 again.

9. The case has crio / mg / sos report logs and other observations/analysis

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

crio-goroutine-stacks-2022-10-07T210308Z.log
209 kB
2022/10/10 5:12 PM
journalctl_-u_crio-ocpdev-be-build103-20221012T203709.xz
33.92 MB
2022/10/13 4:00 PM
journalctl_-u_crio-ocpdev-be-build103-20221012T203709-1.xz
33.92 MB
2022/10/13 4:01 PM
sosreport-ocpdev-be-build103-2022-10-07-hkhdyfz.tar.xz
25.87 MB
2022/10/10 5:13 PM
sosreport-ocpdev-be-build103-2022-10-12-nkozjkl.tar.xz
30.79 MB
2022/10/13 4:01 PM

Assignee:: Peter Hunt

Reporter:: Anand Paladugu

Need Info From:: None

Contributors:: None

QA Contact:: Sunil Choudhary

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2022/10/03 8:27 PM

Updated:: 2025/09/13 6:20 AM

Resolved:: 2022/12/06 4:19 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide