Date: Thu, 09 Oct 2003 02:18:25 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Harti Brandt <brandt@fokus.fraunhofer.de> Cc: freebsd-hackers@freebsd.org Subject: Re: Dynamic reads without locking. Message-ID: <3F8527E1.26ED0CF6@mindspring.com> References: <20031008083059.GA520@garage.freebsd.pl> <20031008114506.I63940@beagle.fokus.fraunhofer.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Harti Brandt wrote: > You need to lock when reading if you insist on consistent data. Even a > simple read may be non-atomic (this should be the case for 64bit > operations on all our platforms). So you need to do > > mtx_lock(&foo_mtx); > bar = foo; > mtx_unlock(&foo_mtx); > > if foo is a datatype that is not guaranteed to be red atomically. For > 8-bit data you should be safe without the lock on any architecture. I'm > not sure for 16 and 32 bit, but for 64-bit you need the look for all > our architectures, I think. > > If you don't care about occasionally reading false data (for statistics or > such stuff) you can go without the lock. For a read-before-write type read, this is always true. For an interlock read, this is also true, if you intend to use the data you read as a timing-dependent one-shot (e.g. for a condition variable, etc.). For certain uses, however, it's safe to not lock before the read *on Intel architectures*. This can go out the window on SPARC or PPC, or any architecture where there is no guarantee that there won't be speculative execution or out-of-order execution without explicit synchronization. For instance, on a PPC, there are two instructions that are used to implement a mutext, "isync" and "dsync", one which is used to check it, and the other which is used to take it. Using these, it's a lot less expensive to implement a mutex, and you are guaranteed that after you take a mutex, there's no speculative execution coherency issues left outstanding. For things with an explicit periodicity, with no explicit guarantee of order of operation, reading without a lock will do no damage. For example, consider the case of a push-model migration of processes from one CPU to another, and a (in the general case) lockless implementation of a per CPU scheduler queue with explicit migration requirements -- this avoids the FreeBSD issue of having to take a lock each time you enter the scheduler, and avoids the IPI that results, as well as implying implicit CPU affinity; here's how it works: Each CPU has 3 values associated with it: int load ptr migration queue ptr run queue When it's time to schedule a process on a given CPU, here is the order of operation: 1) Compare migration queue ptr to NULL; this is a lockless operation. 2) If the ptr is non-NULL, a) take a lock on the migration queue. b) move items on it to the per CPU run queue -- LIFO; writing the per CPU run queue is a lockless operation, since no other CPU is permited access to it (no "pull" of work for migration). c) NULL the migration queue head of the now empty migration queue. d) drop the migration queue lock. 3) Take the entry off the top of my run queue; this is what I'm going to run next. 4) Compare my 'load' value against that of other CPUs; since a CPU is the only one permitted to write it's own load value, this is a lockless operation for the read of all other CPU's 'load' value, so long as that value is accessible in a single bus/memory cycle. 5) If my load is significantly higher than the lowest of all other CPU's, consider migrating work to the lowest of all other CPU's: a) Locate any candidates for migration; these is the current top of my own run queue *after* the removal of the next thing I'm going to run, *without* the "don't migrate" bit set. b) If a candidate exists, i) Remove the item from the local run queue (lockless). ii) Take the lock on the target CPU migration queue. iii) Put the item on the target migration queue. iv) Drop the migration queue lock. 6) Recalculate my local 'load' value for the benefit of other CPUs. 7) Start executing the item obtained in step #3. Because the operation occurs periodically, and because you skip items that are marked non-migrate (you could have a bounded depth to the search, if so desired), there is the ability to implement explicit CPU affinity, and there is also the ability to implement implicit CPU affinity (since there is hysteresis in the act of deciding to "push" work off to another CPU). The purpose of LIFO ordering, and examining the migration queue before examining the local run queue *after* the LIFO ordering of inserts of work from other CPUs is that the work on the other CPU that pushed to you was at the top of its run queue, and this avoids penalizing migrated objects one scheduler cycle latency. Note that all this works, even in a decoherent system on the migration queue examination, since the worst case is you delay a migration for up to two scheduling cycles (one for the 'load' value on the origin, one for the migration queue examination on the target), and the common case is a 1.5 cycle delay (since the grab of the migration queue mutex by the origin forces a coherency cycle). And you are guaranteed to be delaying the thing to run until the next scheduling cycle on the origin CPU anyway. So there *are* cases, where it's safe, as long as you have either: 1) Multiple reader, single fixed writer, OR 2) Multiple locked writer, singe fixed reader, AND 3) Repetition with implicit hystersis, AND 4) Insensitivity to order of operation I'm pretty sure that the cases Jeffrey was thinking about all fall into that description bucket. If they don't, then he's probably not caring about SPARC and PPC, which would make sense, in terms of the model that Dragonfly BSD is using -- though I expect that they will fail to support NUMA or clustered systems properly, if that's the case (which I doubt). -- Terry
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3F8527E1.26ED0CF6>