Date: Wed, 27 Sep 2000 21:52:06 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: arch@freebsd.org Cc: tlambert@primenet.com Subject: Re: Mutexes and semaphores (was: cvs commit: src/sys/conf files Message-ID: <200009272152.OAA27175@usr02.primenet.com> In-Reply-To: <200009271851.LAA07258@vashon.polstra.com> from "John Polstra" at Sep 27, 2000 11:51:54 AM
next in thread | previous in thread | raw e-mail | index | archive | help
John Polstra writes: > Terry Lambert <tlambert@primenet.com> wrote: > > The idea that you should ever go to sleep waiting for a mutex is > > antithetical to the very idea. > > Are you assuming that mutexes spin but semaphores sleep? You haven't > actually said so explicitly, but I'm starting to think that's your > assumption. [ ... ] > > It does. Semaphores can be held across a sleep (a wait). > > You must be assuming that mutexes always spin and semaphores don't. I > don't agree with that assumption, but at least it would explain why we > can't seem to communicate effectively on this topic. Yes, this is exactly my assumption. Here's why: You have to ask "why do I use a mutex?"; the answer is "to protect data". Then you have to ask "why do I need to protect data?"; the answer is more complicated, but boils down to "to prevent concurrent access in various and sundry situations which may arise". To my mind, there are only four cases which may result in an attempt at concurrent access: 1) SMP; concurrent access is attempted by another processor 2) Kernel reentrancy as a result of an excpetion (e.g. a page fault) 3) Kernel reentrancy as a result of an interrupt 4) Kernel preemption, as part of a Real Time subsystem In #1, it is not only acceptable to spin, it is preferred, since we know that (A) a mutex is only held for a very short period of time, and (B) we don't want to pay the penalty for going to sleep, since we are talking about stalling a single processor for a very short period of time, and we may have 8 processors, 7 of which would have to pay the penalty for a signal delivery, if a wakeup occurred. In #2 and #3, the degenerate cases are identical. One could conceive of a shared PCI interrupt being handled by different processors, given "bottom-end single threading" and "top-end multithreading". But devices own their own resources; only if we were able to multithread our "bottom-end", the actual hardware event handler that does nothing other than handle the hardware event to the point that it is no longer needed (this implies reenabling the interrupt outside of interrupt context), would there be a contented resource issue to resolve. Further, when operating in interupt or exception context, we are running in the context that was active at the time of the event. This bodes ill for recursive mutex acquisition, since it means that we could change data out from under the active context, despite it holding the mutex. Thus we can not use a mutex to contend resource between interrupt, exception, and normal contexts, if by ownership, we mean one of the three. NT resolves this by turning interrupt and exception contexts into heavy-weight contexts, on a par with a process (different page mapping, etc.). If FreeBSD does not do the same, then there MUST be no resource contention between these domains. If it does the same, then a modification is required: you still spin, but you do an explicit yield to the mutex holder during each cycle through the spin; so we are left needing an owner that is unique between all contexts, not simply between kernel threads. This would only end up being useful on systems where the I/O bus was never contended between drivers in a non-transparent way (e.g. all data movement is based on bus mastered DMA, with hardware contention, and interrupt signalling by the host when a DMA should be started, and by the device, when host processing should be started). In #4, we have the need to sleep, since it is possible that you will be put to sleep involuntarily while holding a mutex. The easiest way to handle this is to merely delay the sleep until all mutexes held by the unique owner have been relinquished. This implies a condition variable in the process structure, an overall mutex hold count, and another mutex to protect the hold count and the condition variable, associated with the context structure unique to the identity "owner". This will permit priority lending that lasts only for the duration of the held contended resource(s). Use of this complicates matters, but the benefit of RT support that would come with it is high, if RT is what floats your boat. A system utilizing this approach could be conditionally compiled as "RT" or "non-RT", using macro substitution, so the hit need not be taken in a "GENERIC" kernel. - So for the most part, unless we are implementing RT and a separate context for each potential concurrent interrupt to the host, and each concurrent exceptional condition (I see a need for a minimum of 2, for the F00F bug, and for 386 kernel page write fault processing), mutexes should spin. > > Even if you can make a case for this not being true (e.g. moving > > a vnode from one list to another, using the same pointers in > > the vnode to track state on both lists, which is really just an > > acquire/remove/release/aquire/insert/release operation, where you > > have a window between the removal and the reinsertion), it can be > > handled by strictly controlling the order of operation on mutex > > acquisition, and inverting the release order, and backing off in > > case of conflict. > > Yes, that's the standard way of avoiding deadlock. Though as far as > I can see, the release order doesn't actually matter, since releasing > never blocks anybody. The inversion is "acquire A/acquire B/process/release A/release B" ensures that the acquisition can occur concurrently in a forward path, in the fact of interrupt/exception/kernel preemption. It prevents a starvation deadlock. In the RT priority lending case (to support kernel preemption), it's possible to have a deadly embrace deadlock without this, as well, based on a low priority task "hogging" a conteded resource using a two mutex strategy, and therefore raising its own priority (this isn't a security issue, since user space mutex code is not an externalization of kernel space mutex code). By permitting the lending context to acquire A and go back into the spin/yield loop for resource B, you preclude this. For constructs like: acquire A/diddle A/release A acquire B/diddle B/release B acquire A/diddle A/release A You would have to recode them as: acquire A/diddle A acquire B/diddle B diddle A/release A release B As previously pointed out, this is most likely with list manipulation involving variables shared between lists that are protected by different mutexes. > > > If the mutex is recursively held, there is a problem in that some > > > other code grabbed the mutex and expected it to protect the data > > > structure from being changed underfoot. > > > > Worst case, set an "IN_USE" flag on the data in a flags field to > > bar reentry on a given data item. Best case, fix the broken code. > > The vnode locking code does this today (I'd argue that it's broken > > code). > > Well, we are dealing with a lot of legacy code that was never designed > with threads in mind. I personally believe that the recursive mutex > is a reasonable primitive to deal with it, particularly during the > transition phase. Well, we have this flag _now_ in the vnode code, so it's not like anything is being saved by not using it. I really don't see much code that has this problem. There's the scheduler, the VM system, and some process structure stuff having to do with fork/exec/_exit, but that seems to be it, after the vnode cruft. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200009272152.OAA27175>