-
Task
-
Resolution: Unresolved
-
Major
-
None
-
SaaS
-
None
-
False
-
-
False
-
Not Started
-
Not Started
-
Not Started
-
Not Started
-
Not Started
-
Not Started
-
-
With Sidekiq running on OCP there is a risk that the pod where the enqueuer jobs run gets killed. There might be multiple reasons why OCP may kill the pods:
- HPA can create/kill a pod at any time
- kube-scheduler migh decide to evict a pod and move it to another pod/node at any time
- If memory has a spike within the container memory limit, container will be restarted because of OOM killer
- If a pod is not passing the healthcheck for whatever reason, k8s will restart the container
- Upon a worker node maintenance, replacement, OCP upgrade... pods will be moved, and so, killed
- etc...
This is especially critical for Billing enqueuer job, which can take about 20 minutes.
We need to ensure that even when OCP wants to kill the pod:
- it does it gracefully, and allowing a reasonable termination grace period
- sidekiq receives the signal properly and its timeout is configured to be less than the OCP grace period
- if the job can't be completed within the grace period, the job is returned to Redis correctly
- Rerunning the enqueuer job does not have undesired side effects (e.g. it is safe to run Billing twice - or have a mechanism to prevent running daily billing on the same day for the same account)
Currently, to minimize the risk of the enqueuer job to get killed, the HPA for the sidekiqDefault pod (the one where enqueuer job runs) is configured to be fixed to 3 pods. So, from the multiple reasons why a pod may be killed, we are at least removing HPA.
As an extra measure the termination grace period can be increased on Sidekiq pods.
Another idea suggested by rvazquez@redhat.com :
Regarding this, there's the option of adding a pre-stop script in k8s/ocp pods. You could write a script that waits until all running jobs are completed. We use that to wait for all connections to backend to be properly drained before stopping the pod. That could be a way to ensure long running jobs are not killed.
Some references with more info:
https://github.com/mperham/sidekiq/wiki/Deployment#overview
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
- is related to
-
THREESCALE-3686 System SaaS migration to OCP
-
- Closed
-
- relates to
-
THREESCALE-8893 Split billing cron job in chunks
-
- To Develop
-