Ticket #120 (closed enhancement: fixed)

Opened 22 months ago

Last modified 16 months ago

automatically restart crashed services

Reported by: Toei Rei Owned by: roy
Priority: minor Milestone:
Component: rc Version:
Keywords: Cc: ohnobinki@ohnopublishing.net, hollow@gentoo.org

Description

It would be great for servers if openrc could automatically restart crashed services.

Change History

comment:1 in reply to: ↑ description Changed 21 months ago by anonymous

vote +1 for this one

comment:2 follow-up: ↓ 3 Changed 19 months ago by piavka@cs.bgu.ac.il

Only if you have a very reliable way for openrc to check that service is actually crushed, this might be good idea. I've got serveral services perfectly running but reported as crashed in openrc

(but ok with baselayout-1.*) So auto restarting them would be a bad idea in my case. And probably there are many services which are better not auto restarted anyway.

But adding a support for autorestart keyword (to auto restart service which have this keyword explicitly specified in /etc/init.d/service is a good idea)

comment:3 in reply to: ↑ 2 ; follow-up: ↓ 4 Changed 19 months ago by roy

Replying to piavka@…:

Only if you have a very reliable way for openrc to check that service is actually crushed, this might be good idea. I've got serveral services perfectly running but reported as crashed in openrc

(but ok with baselayout-1.*) So auto restarting them would be a bad idea in my case.

The solution is to fix the init scripts :)
openrc-0.4.2 has very good detection these days and any errors are 99% of the time the fault of incorrect start-stop-daemon usage.

The option to restart them would not be on by default though.

And probably there are many services which are better not auto restarted anyway.

But adding a support for autorestart keyword (to auto restart service which have this keyword explicitly specified in /etc/init.d/service is a good idea)

I don't see this myself. You either want to restart crashed services or you don't want to.

comment:4 in reply to: ↑ 3 Changed 19 months ago by anonymous

Replying to roy:

The solution is to fix the init scripts :)
openrc-0.4.2 has very good detection these days and any errors are 99% of the time the fault >of incorrect start-stop-daemon usage.

Can you point me to some guide of start-stop-daemon correct usage or common pitfalls with it?

The option to restart them would not be on by default though.

And probably there are many services which are better not auto restarted anyway.

But adding a support for autorestart keyword (to auto restart service which have this keyword explicitly specified in /etc/init.d/service is a good idea)

I don't see this myself. You either want to restart crashed services or you don't want to.

Usually I want a service to be auto restarted, but

1)sometimes there are specific services I don't want to auto restart - i want to check first and then restart it manually.
2)there are services that then restarted would trigger an unwanted restart of depending services
which is not always desired.

So if the admin has an option to control which services to autorestart and which not is certainly better.

BTW i did not understood whether the auto restart option is implemented or not?

Thanks

comment:5 Changed 18 months ago by ohnobinki@ohnopublishing.net

  • Cc ohnobinki@ohnopublishing.net added

comment:6 Changed 17 months ago by hollow@gentoo.org

  • Cc hollow@gentoo.org added

comment:7 follow-up: ↓ 8 Changed 16 months ago by roy

  • Status changed from new to closed
  • Resolution set to fixed

r1547 will now start crashed services by default.
However, it still not stop them as it could bring down other critical services.
This is toggleable by rc_crashed_start=YES rc_crashed_stop=YES in /etc/rc.conf

You could automate this now by placing rc into a cron job.
The only question remaining is should we provide an option so that rc only deals with crashed services and doesn't stop any manually running services? Re-open or file a new bug if you want this.

comment:8 in reply to: ↑ 7 ; follow-up: ↓ 9 Changed 16 months ago by piavka@cs.bgu.ac.il

Replying to roy:

r1547 will now start crashed services by default.
However, it still not stop them as it could bring down other critical services.
This is toggleable by rc_crashed_start=YES rc_crashed_stop=YES in /etc/rc.conf

You could automate this now by placing rc into a cron job.

I did not understand if creashed services will be restarted automatically then I have rc_crashed_start=YES or still cron job should do it by running rc i.e. some thing like this:

  • * * * * /bin/rc 3

Doing this with a cron job definetly seems to me like a bad idea. (I would even prever writing a cron job which parses the output of rc-status for crashed services and auto restarts only services which I want and notifies me about the crashed services which I prefer to restart manually)

I was thinking more about intergating the functionality of deamontools(or even better OpenSolaris? self-healig services with SMF framework), into openrc

which restarts a service as soon as it detects it has crashed
and so that I could specify explicitly for each service separately if it should be monitored and restarted automatically or not, ie for example have a keyword for that in /etc/init./servicename:
depend() {

keyword monitor

}
or somtheing in /etc/rc.conf instead.

what do you think?

The only question remaining is should we provide an option so that rc only deals with crashed >services and doesn't stop any manually running services? Re-open or file a new bug if you want >this.

comment:9 in reply to: ↑ 8 ; follow-up: ↓ 10 Changed 16 months ago by roy

Replying to piavka@…:

I did not understand if creashed services will be restarted automatically then I have rc_crashed_start=YES or still cron job should do it by running rc i.e. some thing like this:

  • * * * * /bin/rc 3

Doing this with a cron job definetly seems to me like a bad idea. (I would even prever writing a cron job which parses the output of rc-status for crashed services and auto restarts only services which I want and notifies me about the crashed services which I prefer to restart manually)

Well, you can now do this as well using r1550

for svc in $(rc-status --crashed); do
   rc-service $svc -- --nodeps restart
done

In turn you could do something more complete like this

echo "Checking for crashed services:"
rc_monitor_restart="apache dovecot postfix postgres"
for svc in $(rc-status --crashed); do
    for chk in $rc_monitor_restart; do
        if [ "$chk" = "$svc" ]; then
            # We use --nodeps as a restart could stop critical services that depend on us
            rc-service $svc -- --nodeps restart
            continue 2
        fi
    done
    echo "  $svc crashed and requires a manual restart"
done

That makes things a lot easier than parsing existing rc-status output :)

I was thinking more about intergating the functionality of deamontools(or even better OpenSolaris? self-healig services with SMF framework), into openrc

which restarts a service as soon as it detects it has crashed
and so that I could specify explicitly for each service separately if it should be monitored and restarted automatically or not, ie for example have a keyword for that in /etc/init./servicename:
depend() {

keyword monitor

}
or somtheing in /etc/rc.conf instead.

what do you think?

I don't see what benefit having another running daemon would be.
Also, servers generally have some kind of heart beat or system status daemon that monitors other things such as load, number of users, free disk space, etc. It would be better to try and integrate with that easier than providing yet another daemon.

The beauty of running via cron is that it's easy to get reports mailed to the right people.

comment:10 in reply to: ↑ 9 ; follow-up: ↓ 11 Changed 16 months ago by piavka@cs.bgu.ac.il

Well, you can now do this as well using r1550

Great ,thanks for this patch.

I don't see what benefit having another running daemon would be.
Also, servers generally have some kind of heart beat or system status daemon that monitors other things such as load, number of users, free disk space, etc. It would be better to try and integrate with that easier than providing yet another daemon.

Well the goal is to restart a service as soon as it fails. In other words make the failure detection event based and not poll based (like with cron or other polling daemon).
One way to do this could be using inotify in the monitoring daemon which monitors directory changes where openrc registers crashed services (It could even be a daemonized simple shell script utilizing inotify-tools). The question is how openrc detects crashed services? Does it notice the crash immediately or manually only then rc-status command is run?

comment:11 in reply to: ↑ 10 ; follow-ups: ↓ 12 ↓ 14 Changed 16 months ago by roy

Replying to piavka@…:

Well the goal is to restart a service as soon as it fails. In other words make the failure detection event based and not poll based (like with cron or other polling daemon).
One way to do this could be using inotify in the monitoring daemon which monitors directory changes where openrc registers crashed services (It could even be a daemonized simple shell script utilizing inotify-tools). The question is how openrc detects crashed services? Does it notice the crash immediately or manually only then rc-status command is run?

OpenRC only notices a crash when asked to check.
It can be quite a bit of work for OpenRC to find out if a process is still running or not, especially on BSD systems where /proc may not exist and we have to make kernel calls.
We just can't find the pid and remeber it, as some daemons re-fork and re-write their pids on SIGHUP.

We don't store the fact that it crashed on disk as the daemon could be restarting itself for some reason outside of OpenRC.

comment:12 in reply to: ↑ 11 Changed 16 months ago by piavka@cs.bgu.ac.il

Replying to roy:

We don't store the fact that it crashed on disk as the daemon could be restarting itself for some reason outside of OpenRC.

Then what for is there the /lib/rc/init.d/failed directory?

comment:13 Changed 16 months ago by roy

So that when a service fails to start (ie it returns non zero) OpenRC doesn't try to start it again.
We only check crashed for started services :)

comment:14 in reply to: ↑ 11 Changed 16 months ago by piavka@cs.bgu.ac.il

Replying to roy:

OpenRC only notices a crash when asked to check.

Then what checks are done by OpenRC to see if a serice has crashed or not?

It can be quite a bit of work for OpenRC to find out if a process is still running or not, especially on BSD systems where /proc may not exist and we have to make kernel calls.
We just can't find the pid and remeber it, as some daemons re-fork and re-write their pids on SIGHUP.

comment:15 Changed 16 months ago by roy

Depends on the paramters given to start-stop-daemon.

If a pidfile is given then we just check if the pid is still valid.
This is a fast test for all systems and doesn't require any special permissions beyond being able to read the pidfile.

Otherwise we scan the running processes and match process name and optionally command line arguments. This is the slow bit, and does require super user permissions on some platforms.

Note: See TracTickets for help on using tickets.