-
Bug
-
Resolution: Done
-
Normal
-
6.15.0
-
0
-
False
-
-
False
-
CLOSED
-
1,200
-
Endeavour
-
-
-
Moderate
-
Yes
Description of problem:
When dynflow-sidekiq service is restarted while a job is run on multiple hosts sequentially, the next host's run in line after the restart fails: "Error loading data from Capsule: RestClient::NotFound - 404 Not Found"
Version-Release number of selected component (if applicable):
Reproduced on Sat Stream 6.15 snap 37.0.
I couldn't reproduce on Sat 6.14 so this is a regression.
How reproducible:
Deterministic
Steps to Reproduce:
1. Have a Satellite with two hosts registered
2. Create a Job Template useful for your debugging, I used contents:
echo $(date) >> /root/test-<%= @host %>; sleep 120; echo slept-$(date) >> /root/test-<%= @host %>
3. Monitor -> Jobs -> Run Job
4. Select that template
5. Set filter to match the two hosts
6. Set concurrency level to one
7. Submit
8. On Satellite:
- systemctl stop dynflow-sidekiq@*
9. Wait a few seconds...
10. # systemctl start dynflow-sidekiq@orchestrator.service dynflow-sidekiq@worker-1.service dynflow-sidekiq@worker-hosts-queue-1.service
11. Wait...
12. In WebUI, watch Job run details. Run on the first host should end somehow depending on what phase the daemon was killed in - it either fails after ~15 minutes or succeeds, it doesn't matter. Then run on the second host should start.
Actual results:
The run never finishes and its output shows the following error repeated indefinitely: "Error loading data from Capsule: RestClient::NotFound - 404 Not Found". No further hosts will ever run the job.
Expected results:
The run on second host should pass and other potential hosts should get their turn afterwards.
- external trackers