Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 15 Nov 2019 14:32:44 +1030
From:      "O'Connor, Daniel" <darius@dons.net.au>
To:        Eugene Grosbein <eugen@grosbein.net>
Cc:        Ian Lepore <ian@freebsd.org>, Daniel Braniss <danny@cs.huji.ac.il>, freebsd-hackers <freebsd-hackers@freebsd.org>
Subject:   Re: can the hardware watchdog reboot a hung kernel?
Message-ID:  <92134BA3-3BB3-4377-B9A7-1B1D702824F7@dons.net.au>
In-Reply-To: <eefaafea-54e0-a692-8588-a753b37b571c@grosbein.net>
References:  <EC4DB495-55D0-44BB-8D6A-0301785FADC7@cs.huji.ac.il> <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net> <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il> <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org> <BEC1714A-2361-4B62-BEB9-82808920C269@cs.huji.ac.il> <ede820ea5c5f71cea2a98955d02b700b483e1899.camel@freebsd.org> <eefaafea-54e0-a692-8588-a753b37b571c@grosbein.net>

next in thread | previous in thread | raw e-mail | index | archive | help


> On 15 Nov 2019, at 14:29, Eugene Grosbein <eugen@grosbein.net> wrote:
>=20
> 15.11.2019 1:19, Ian Lepore wrote:
>=20
>> One thing to be careful of here is multicore systems.  If you have a
>> critical app running on a multicore system, that app can hang (maybe =
it
>> tries to read from a device that has malfunctioned and essentially =
gets
>> hung forever in a device driver that doesn't implement timeouts very
>> well or something).  In that case, only one core is hung, so =
watchdogd
>> will be able to keep petting the dog to prevent a reboot, but since
>> your app is hung on a different core, you aren't really getting the
>> protection you need.
>>=20
>> The fix for that is to either turn you app into watchdogd (have it =
make
>> the periodic ioctl() calls to pet the dog), or use the '-e cmd' =
option
>> with watchdogd, and make 'cmd' be a script that somehow verifies that
>> your critical application is still running properly.
>=20
> I have not tried it myself, but there may be easier way
> if the app is single-process and single-threaded: use cpuset(1) to =
bind
> both of the app and watchdogd to same core.

You can get watchdogd to run a script, so you could have it check for =
liveness somehow and the dog will bite if it isn't.

--
Daniel O'Connor
"The nice thing about standards is that there
are so many of them to choose from."
 -- Andrew Tanenbaum





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?92134BA3-3BB3-4377-B9A7-1B1D702824F7>