Loading...

XML

Word

Printable

Type: Task
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: SaaS
Component/s: System
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
3Scale PT Tested upstream:
Not Started
3scale PT Docs:
Not Started
3scale PT Product Specs:
Not Started
3scale PT Product Update Ready:
Not Started
3scale PT Released In Saas:
Not Started
3scale PT Verified Product:
Not Started
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

With Sidekiq running on OCP there is a risk that the pod where the enqueuer jobs run gets killed. There might be multiple reasons why OCP may kill the pods:

HPA can create/kill a pod at any time
kube-scheduler migh decide to evict a pod and move it to another pod/node at any time
If memory has a spike within the container memory limit, container will be restarted because of OOM killer
If a pod is not passing the healthcheck for whatever reason, k8s will restart the container
Upon a worker node maintenance, replacement, OCP upgrade... pods will be moved, and so, killed
etc...

This is especially critical for Billing enqueuer job, which can take about 20 minutes.

We need to ensure that even when OCP wants to kill the pod:

it does it gracefully, and allowing a reasonable termination grace period
sidekiq receives the signal properly and its timeout is configured to be less than the OCP grace period
if the job can't be completed within the grace period, the job is returned to Redis correctly
Rerunning the enqueuer job does not have undesired side effects (e.g. it is safe to run Billing twice - or have a mechanism to prevent running daily billing on the same day for the same account)

Currently, to minimize the risk of the enqueuer job to get killed, the HPA for the sidekiqDefault pod (the one where enqueuer job runs) is configured to be fixed to 3 pods. So, from the multiple reasons why a pod may be killed, we are at least removing HPA.

As an extra measure the termination grace period can be increased on Sidekiq pods.

Another idea suggested by rvazquez@redhat.com :

Regarding this, there's the option of adding a pre-stop script in k8s/ocp pods. You could write a script that waits until all running jobs are completed. We use that to wait for all connections to backend to be properly drained before stopping the pod. That could be a way to ensure long running jobs are not killed.

Some references with more info:

https://github.com/mperham/sidekiq/wiki/Deployment#overview

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

https://dev.betterdoc.org/docker/linux/container/signals/pid1/2021/06/18/how-docker-forced-me-to-learn-more-about-linux.html

https://www.bigbinary.com/blog/increase-reliability-of-background-job-processing-using-super_fetch-of-sidekiq-pro

is related to

THREESCALE-3686 System SaaS migration to OCP

Closed

relates to

THREESCALE-8893 Split billing cron job in chunks

To Develop

Assignee:: Unassigned

Reporter:: Daria Mayorova

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/02/03 10:02 AM

Updated:: 2025/05/21 2:21 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates