Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 27 Sep 2000 21:52:06 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        arch@freebsd.org
Cc:        tlambert@primenet.com
Subject:   Re: Mutexes and semaphores (was: cvs commit: src/sys/conf files
Message-ID:  <200009272152.OAA27175@usr02.primenet.com>
In-Reply-To: <200009271851.LAA07258@vashon.polstra.com> from "John Polstra" at Sep 27, 2000 11:51:54 AM

next in thread | previous in thread | raw e-mail | index | archive | help
John Polstra writes:
> Terry Lambert  <tlambert@primenet.com> wrote:
> > The idea that you should ever go to sleep waiting for a mutex is
> > antithetical to the very idea.
> 
> Are you assuming that mutexes spin but semaphores sleep?  You haven't
> actually said so explicitly, but I'm starting to think that's your
> assumption.

[ ... ]

> > It does.  Semaphores can be held across a sleep (a wait).
> 
> You must be assuming that mutexes always spin and semaphores don't.  I
> don't agree with that assumption, but at least it would explain why we
> can't seem to communicate effectively on this topic.

Yes, this is exactly my assumption.  Here's why:

You have to ask "why do I use a mutex?"; the answer is "to protect
data".

Then you have to ask "why do I need to protect data?"; the answer
is more complicated, but boils down to "to prevent concurrent
access in various and sundry situations which may arise".

To my mind, there are only four cases which may result in an
attempt at concurrent access:

1)	SMP; concurrent access is attempted by another
	processor

2)	Kernel reentrancy as a result of an excpetion (e.g.
	a page fault)

3)	Kernel reentrancy as a result of an interrupt

4)	Kernel preemption, as part of a Real Time subsystem


In #1, it is not only acceptable to spin, it is preferred, since
we know that (A) a mutex is only held for a very short period of
time, and (B) we don't want to pay the penalty for going to
sleep, since we are talking about stalling a single processor for
a very short period of time, and we may have 8 processors, 7 of
which would have to pay the penalty for a signal delivery, if a
wakeup occurred.

In #2 and #3, the degenerate cases are identical.  One could
conceive of a shared PCI interrupt being handled by different
processors, given "bottom-end single threading" and "top-end
multithreading".  But devices own their own resources; only
if we were able to multithread our "bottom-end", the actual
hardware event handler that does nothing other than handle the
hardware event to the point that it is no longer needed (this
implies reenabling the interrupt outside of interrupt context),
would there be a contented resource issue to resolve.  Further,
when operating in interupt or exception context, we are running
in the context that was active at the time of the event.  This
bodes ill for recursive mutex acquisition, since it means that
we could change data out from under the active context, despite
it holding the mutex.  Thus we can not use a mutex to contend
resource between interrupt, exception, and normal contexts, if
by ownership, we mean one of the three.  NT resolves this by
turning interrupt and exception contexts into heavy-weight
contexts, on a par with a process (different page mapping, etc.).
If FreeBSD does not do the same, then there MUST be no resource
contention between these domains.  If it does the same, then a
modification is required: you still spin, but you do an explicit
yield to the mutex holder during each cycle through the spin; so
we are left needing an owner that is unique between all contexts,
not simply between kernel threads.  This would only end up being
useful on systems where the I/O bus was never contended between
drivers in a non-transparent way (e.g. all data movement is
based on bus mastered DMA, with hardware contention, and interrupt
signalling by the host when a DMA should be started, and by
the device, when host processing should be started).

In #4, we have the need to sleep, since it is possible that you
will be put to sleep involuntarily while holding a mutex.  The
easiest way to handle this is to merely delay the sleep until
all mutexes held by the unique owner have been relinquished.
This implies a condition variable in the process structure,
an overall mutex hold count, and another mutex to protect the
hold count and the condition variable, associated with the
context structure unique to the identity "owner".  This will
permit priority lending that lasts only for the duration of the
held contended resource(s).  Use of this complicates matters,
but the benefit of RT support that would come with it is high,
if RT is what floats your boat.  A system utilizing this
approach could be conditionally compiled as "RT" or "non-RT",
using macro substitution, so the hit need not be taken in a
"GENERIC" kernel.

- So for the most part, unless we are implementing RT and a
separate context for each potential concurrent interrupt to
the host, and each concurrent exceptional condition (I see
a need for a minimum of 2, for the F00F bug, and for 386
kernel page write fault processing), mutexes should spin.


> > Even if you can make a case for this not being true (e.g. moving
> > a vnode from one list to another, using the same pointers in
> > the vnode to track state on both lists, which is really just an
> > acquire/remove/release/aquire/insert/release operation, where you
> > have a window between the removal and the reinsertion), it can be
> > handled by strictly controlling the order of operation on mutex
> > acquisition, and inverting the release order, and backing off in
> > case of conflict.
> 
> Yes, that's the standard way of avoiding deadlock.  Though as far as
> I can see, the release order doesn't actually matter, since releasing
> never blocks anybody.

The inversion is "acquire A/acquire B/process/release A/release B"
ensures that the acquisition can occur concurrently in a forward
path, in the fact of interrupt/exception/kernel preemption.  It
prevents a starvation deadlock.

In the RT priority lending case (to support kernel preemption),
it's possible to have a deadly embrace deadlock without this, as
well, based on a low priority task "hogging" a conteded resource
using a two mutex strategy, and therefore raising its own priority
(this isn't a security issue, since user space mutex code is not
an externalization of kernel space mutex code).  By permitting the
lending context to acquire A and go back into the spin/yield loop
for resource B, you preclude this.

For constructs like:

	acquire A/diddle A/release A
	acquire B/diddle B/release B
	acquire A/diddle A/release A

You would have to recode them as:

	acquire A/diddle A
	acquire B/diddle B
	diddle A/release A
	release B

As previously pointed out, this is most likely with list
manipulation involving variables shared between lists that
are protected by different mutexes.


> > > If the mutex is recursively held, there is a problem in that some
> > > other code grabbed the mutex and expected it to protect the data
> > > structure from being changed underfoot.
> >
> > Worst case, set an "IN_USE" flag on the data in a flags field to
> > bar reentry on a given data item.  Best case, fix the broken code.
> > The vnode locking code does this today (I'd argue that it's broken
> > code).
> 
> Well, we are dealing with a lot of legacy code that was never designed
> with threads in mind.  I personally believe that the recursive mutex
> is a reasonable primitive to deal with it, particularly during the
> transition phase.

Well, we have this flag _now_ in the vnode code, so it's not
like anything is being saved by not using it.  I really don't
see much code that has this problem.  There's the scheduler,
the VM system, and some process structure stuff having to do
with fork/exec/_exit, but that seems to be it, after the vnode
cruft.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200009272152.OAA27175>