Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 16 Jun 2013 11:46:17 +0100
From:      David Chisnall <theraven@FreeBSD.org>
To:        Florent Peterschmitt <florent@peterschmitt.fr>
Cc:        "freebsd-current@freebsd.org FreeBSD" <freebsd-current@FreeBSD.org>
Subject:   Re: Handle kernel module crashes
Message-ID:  <B4C001E3-86FC-409C-8B33-A52E7115E0C1@FreeBSD.org>
In-Reply-To: <51B5E575.1030006@peterschmitt.fr>
References:  <51B5E040.2030709@peterschmitt.fr> <CAFMmRNxPcmx4gtwQfLjaFnMhAxBcBzYBd45vxJDcAU55ZFirQw@mail.gmail.com> <51B5E575.1030006@peterschmitt.fr>

next in thread | previous in thread | raw e-mail | index | archive | help

--Apple-Mail=_2822E116-B807-4636-A85C-48F2E3D24CE8
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

On 10 Jun 2013, at 15:40, Florent Peterschmitt <florent@peterschmitt.fr> =
wrote:

> Ok and isn't it a "bad" thing ? I mean, even if the video driver
> crashes, I still want to have the ability to reboot the right way,
> avoiding corrupted files and WIP lose.
>=20
> Another thing is a non-critical module that can crash, but because not
> used by all apps on the machine, letting them ones that can continue =
run.
>=20
> But I don't know what is the approach of FreeBSD and devs about that.

Yes, it's a bad thing.  If we had privilege domain crossing that was as =
cheap as a function call (or, at least, almost as cheap) then we could =
implement fine-grained separation within the kernel and not incur any =
performance penalty.  Unfortunately, this is not possible without some =
fairly significant changes to current CPU instruction sets (which, =
actually, several of us in FreeBSD land are working on, but that's =
unlikely to be seen in any mainstream processor for at least 5-10 =
years). =20

In the current world, we have a fairly poor selection of choices for =
isolation.  On i386, we had 4 protection rings, but on the 486 and newer =
the cost of transitions between to and from rings 1 and 2 were =
increasingly expensive because most operating systems only used rings 0 =
and 3 (Netware and OS/2 are the two exceptions that I know of).  On =
other architectures we just have privileged and unprivileged modes.  =
Code in privileged mode can't be isolated from other code in privileged =
mode, code that is in unprivileged mode incurs some overhead for calls =
into privileged mode.

There are some tricks that you can do to enforce some weaker protection. =
 For example, every driver could be written on 64-bit platforms to use =
32-bit pointers and have a 4GB segment of privileged-mode virtual memory =
allocated for it to use and have to go through special gates to do =
anything with the whole kernel's address space.  You'd then end up with =
a lot more TLB churn, but protection against a number of kinds of =
pointer error (protection faults inside the 32-bit window would just =
result in that module being killed and restarted). =20

Unfortunately, there are several problems with this.  The most obvious =
is that killing a module is not always trivial.  For example, a module =
may hold various locks, but it's not always clear which module owns a =
lock.  Locks are held by kernel threads, but a thread can have a call =
stack spanning several modules.  Working out exactly which driver holds =
the lock is not always trivial, and there is also the question of what =
you do about a thread that contains some call frames belonging to the =
module that you've just killed.  You'd need to provide some =
exception-like mechanism for handling this case (and unwinding the stack =
in the case where it is potentially corrupt is also nontrivial). =20

An alternative is to run the driver entirely, or mostly, in userspace.  =
The 'mostly' option is often better.  For example, certain categories of =
USB devices are exposed by the FreeBSD kernel as USB generic devices =
(ugen driver) and some userspace component sends USB commands to it.  =
This involves some extra copying, but means that most of the =
(potentially buggy) driver logic is in the application.  If it crashes, =
you lose the application state (which, in a desktop setting, is only =
slightly better than crashing the kernel), but not the whole kernel. =20

In the case of certain modern network interfaces (Infiniband in =
particular) and modern GPUs, the kernel handles even less.  The device =
has some hardware support for multiplexing and isolation and so all that =
the kernel has to do is set up some memory that both the device and the =
userspace code can access - including the device registers for =
controlling a command queue - and then delegate most of the operation to =
the userspace code.  This requires an IOMMU to actually provide =
isolation, otherwise an errant DMA request can still result in accessing =
or modifying kernel memory.

Even with this kind of isolation, there are still potential problems.  =
Many devices react poorly to bad input and can be left in a state that =
is hard to recover from, even if the driver itself is easy to restart.  =
A lot of OS instability (I saw a number as high as 20% of OS crashes =
quoted at MSR recently) is caused by drivers poorly reacting to =
intermittent hardware errors.  Just restarting the driver (an approach =
that they tried) solved some, but not all of these cases.

Of course, there are a lot of things in the kernel that are not drivers. =
 For example, FUSE allows us to run filesystems in userspace instead of =
in the kernel.  This comes with a performance penalty as a result of =
having to copy data from the kernel's buffer cache into the filesystem =
process, then back into the kernel, and then into the destination =
process (for a read - the same sequence in the opposite order on write). =
 Similarly, we have CUSE for character devices, which is used by a lot =
of webcam drivers.  These are a relatively good use-case for userspace =
drivers, because they are typically a streaming interface (data comes =
just from the device and there isn't a lot of need for latency-sensitive =
round trips from the app to the driver) and the latency that users care =
about is on the order of 1/24th of a second, which is a very long time =
on a modern computer.  There are other examples, such as Netmap for =
pushing network packets directly into userspace, which can be combined =
with something like Ilias Marinos' userspace network stack to run the =
entire TCP/IP stack in userspace.

Moving drivers into userspace is not a panacea.  It adds more =
asynchronous behaviour, which makes reasoning about the code harder and =
makes deadlocks far easier to introduce (for example, any userspace =
process has a lot of implicit interactions with the VM subsystem, which =
are more explicit in the kernel, and doesn't have a shared global =
namespace for locks).  Most of the code in the kernel is there because, =
when the code was written, it was the most sensible place for it.  In =
most cases, that is still true, although as CPU and software =
architectures evolve that may change.

David


--Apple-Mail=_2822E116-B807-4636-A85C-48F2E3D24CE8
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename=signature.asc
Content-Type: application/pgp-signature;
	name=signature.asc
Content-Description: Message signed with OpenPGP using GPGMail

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.18 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQIcBAEBAgAGBQJRvZd6AAoJEKx65DEEsqIdpy0QAMYTKaeKqbNXoRBv0+JVMnMi
1cZI4O6WKDJ573tHKd0HH+/ijl7P35X3tX8hdIdLP40R+x+SeImQj/64rcVrogaj
8pPNHeMqlC5cdG2DyBDkSXbjibGpW1vQZVvIbgCP+vlfcfbjUBLUC8WfG2Mjb/uA
GqZhMJ2JkKqHg1N4hxLUSMSJtsqecBfw5ZDa0qWu30TL8aIFoJ3ExzuFQksaMoqd
DuHv+hisMQ5kQDmSXyWS9cWjsaqzBP3rQemP7aVuaD7vsnG6qs6tuuXJyoJwcc2f
V0nUUEiTuF/ZwcRguU77XdfPyfWFqqJTmCIFrPR5c1vU+lop6G/dV5BRsFpBZ3dN
XrYvb4BIbUszevHl0Yz9eCfDeDF41jWtsw/FiA7xxfMmVnesWCz35vZlIK8DTNBj
TqWrtl5RvabsmdtniuvcRMHm0X4m9b4ia1p/QQAjmiKHO2My6/cAVHdTPKkA7p6D
WoipuLX5GfrhSPVxVpa9DHQwtTJPTqlIgSyUiRYIB0Euo1N1EXS4vAsTVZrh4FJQ
ywJane3XwWKt2pb89a3AAtupzUyw1lJUiogIjAUxwkpHcS6jFASIagTk8Hc8u+iL
ZyQZ+BZ/wxmU2lJk7geo7srpHOw/HlArsgZM23qEJC3AD3ix2zLZDFRE3KIEqAP+
Zf5AXT1BOZ23qSHwJEML
=FHc5
-----END PGP SIGNATURE-----

--Apple-Mail=_2822E116-B807-4636-A85C-48F2E3D24CE8--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?B4C001E3-86FC-409C-8B33-A52E7115E0C1>