-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
6.17.0
-
False
-
Moderate
-
Satellite Endeavour Refinement
-
sat-endeavour
-
None
-
None
-
None
-
None
Description of problem:
User scenario:
- Satellite (or Capsule) is switched off for some time, including a regular time of log rotation
- Starting the system up, `foreman-proxy` service fails after receiving SIGUSR1 signal from logrotate before the `smart-proxy` process sets proper trap for catching the signal.
- `foreman-proxy` is down until a user intervention
How reproducible:
Rarely (in house) in the above scenario, 100% with bit artificial reproducer below.
Is this issue a regression from an earlier version:
No.
Steps to Reproduce:
The above generic scenario is tricky to reproduce, but it can be narrowed down to 95%-precise, 100% reliable reproducer:
systemctl restart foreman-proxy & sleep 0.9; logrotate -f /etc/logrotate.d/foreman-proxy
(the "sleep 0.9" might vary among systems, let tune it for you)
This leads to the service start failure every time, though this reproducer is not fully true to the user scenario as it does not mimic system start up (including logrotate startup) precisely.
Actual behavior:
foreman-proxy service fails to start and is stopped forever.
Customer scenario during system startup:
Nov 06 09:24:27 satellite systemd[1]: Starting Rotate log files...
Nov 06 09:24:29 satellite logrotate[1075]: Stopping logging:
Nov 06 09:24:30 satellite logrotate[1074]: Redirecting start to /bin/systemctl start auditd.service
Nov 06 09:24:32 satellite systemd[1]: Starting Foreman Proxy...
Nov 06 09:24:36 satellite systemd[1]: foreman-proxy.service: Sent signal SIGUSR1 to main process 1181 (smart-proxy) on client request.
Nov 06 09:24:36 satellite systemd[1]: foreman-proxy.service: Main process exited, code=killed, status=10/USR1
Nov 06 09:24:36 satellite systemd[1]: foreman-proxy.service: Failed with result 'signal'.
Nov 06 09:24:36 satellite systemd[1]: Failed to start Foreman Proxy.
Nov 06 09:24:44 satellite systemd[1]: logrotate.service: Deactivated successfully.
Nov 06 09:24:44 satellite systemd[1]: Finished Rotate log files.
My own logs from the "simplified scenario":
Nov 12 12:33:02 satellite systemd[1]: Stopping Foreman Proxy...
Nov 12 12:33:02 satellite systemd[1]: foreman-proxy.service: Deactivated successfully.
Nov 12 12:33:02 satellite systemd[1]: Stopped Foreman Proxy.
Nov 12 12:33:02 satellite systemd[1]: foreman-proxy.service: Consumed 1.375s CPU time.
Nov 12 12:33:02 satellite systemd[1]: Starting Foreman Proxy...
Nov 12 12:33:03 satellite systemd[1]: foreman-proxy.service: Sent signal SIGUSR1 to main process 9750 (smart-proxy) on client request.
Nov 12 12:33:03 satellite systemd[1]: foreman-proxy.service: Main process exited, code=killed, status=10/USR1
Nov 12 12:33:03 satellite systemd[1]: foreman-proxy.service: Failed with result 'signal'.
Nov 12 12:33:03 satellite systemd[1]: Failed to start Foreman Proxy.
Expected behavior:
The service is running at the end.
Business Impact / Additional info:
It can be tricky to fix the code. We can move the routine to catch the signal (`install_usr1_trap` method in `lib/proxy/signal_handler.rb`) as soon as possible, but there will be always a window of opportunity to hit this bug.
We shall probably rather modify the systemd service to behave in a more clever way. Here two workarounds are possible:
1) Wait nutil logrotate service isnt gone: By having `/etc/systemd/system/foreman-proxy.service.d/91-logrotate-wait.conf` :
[Unit] Wants=logrotate.service After=logrotate.service
(this idea does not prebvent the "simplified reproducer" that mimics logrotate restart by other means. But this idea is sufficient for the real use case, so..)
2) Add restarts of failed startups: add
Restart=on-failure RestartSec=5
to `[Service]` section of `/usr/lib/systemd/system/foreman-proxy.service`.