Date: Tue, 23 Sep 2014 23:20:00 +0200 From: Jilles Tjoelker <jilles@stack.nl> To: John Baldwin <jhb@freebsd.org> Cc: adrian@freebsd.org, freebsd-threads@freebsd.org Subject: Re: sem_post() performance Message-ID: <20140923212000.GA78110@stack.nl> In-Reply-To: <1531724.MPBlj40xOW@ralph.baldwin.cx> References: <20140921213742.GA46868@stack.nl> <1531724.MPBlj40xOW@ralph.baldwin.cx>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Sep 22, 2014 at 03:53:13PM -0400, John Baldwin wrote: > On Sunday, September 21, 2014 11:37:42 PM Jilles Tjoelker wrote: > > It has been reported that POSIX semaphores are slow, in contexts such as > > Python. Note that POSIX semaphores are the only synchronization objects > > that support use by different processes in shared memory; this does not > > work for mutexes and condition variables because they are pointers to > > the actual data structure. > > In fact, sem_post() unconditionally performs an umtx system call. > *sigh* I was worried that that might be the case. > > To avoid both lost wakeups and possible writes to a destroyed semaphore, > > an uncontested sem_post() must check the _has_waiters flag atomically > > with incrementing _count. > > The proper way to do this would be to take one bit from _count and > > use it for the _has_waiters flag; the definition of SEM_VALUE_MAX > > permits this. However, this would require a new set of umtx > > semaphore operations and will break ABI of process-shared semaphores > > (things may break if an old and a new libc access the same semaphore > > over shared memory). > > This diff only affects 32-bit aligned but 64-bit misaligned > > semaphores on 64-bit systems, and changes _count and _has_waiters > > atomically using a 64-bit atomic operation. It probably needs a > > may_alias attribute for correctness, but <sys/cdefs.h> does not have > > a wrapper for that. > It wasn't clear on first reading, but you are using aliasing to get > around the need for new umtx calls by using a 64-bit atomic op to > adjust two ints at the same time, yes? Note that since a failing > semaphore op calls into the kernel for the "hard" case, you might in > fact be able to change the ABI without breaking process-shared > semaphores. That is, suppose you left 'has_waiters' as always true > and reused the high bit of count for has_waiters. > Would old binaries always trap into the kernel? (Not sure they will, > especially the case where an old binary creates the semaphore, a new > binary would have to force has_waiters to true in every sem op, but > even that might not be enough.) I think that everything will break when a binary linked to old and new libcs use the same semaphore. If the new contested bit is set, the old sem_getvalue() will return garbage, the old sem_trywait() will fail even if the real count is greater than 0, the old sem_wait() and sem_timedwait() may spin if the real count is greater than 0 and the old sem_post() will fail with [EOVERFLOW]. That the "hard" path always issues a system call does not help much, since the system calls do not write to _count (this is an throughput optimization, allowing a fast-path thread through while a slow-path thread is entering or leaving the kernel). > > Some x86 CPUs may cope with misaligned atomic ops without destroying > > performance (especially if they do not cross a cache line), so the > > alignment restriction could be relaxed to make the patch more practical. > > Many CPUs in the i386 architecture have a 64-bit atomic op (cmpxchg8b) > > which could be used here. > > This appears to restore performance of 10-stable uncontested semaphores > > with the strange alignment to 9-stable levels (a tight loop with > > sem_wait and sem_post). I have not tested in any real workload. > > Index: lib/libc/gen/sem_new.c > > =================================================================== > > --- lib/libc/gen/sem_new.c (revision 269952) > > +++ lib/libc/gen/sem_new.c (working copy) > > @@ -437,6 +437,32 @@ _sem_post(sem_t *sem) > > if (sem_check_validity(sem) != 0) > > return (-1); > > > > +#ifdef __LP64__ > > + if (((uintptr_t)&sem->_kern._count & 7) == 0) { > > + uint64_t oldval, newval; > > + > > + while (!sem->_kern._has_waiters) { > > + count = sem->_kern._count; > > + if (count + 1 > SEM_VALUE_MAX) > > + return (EOVERFLOW); > > + /* > > + * Expect _count == count and _has_waiters == 0. > > + */ > > +#if BYTE_ORDER == LITTLE_ENDIAN > > + oldval = (uint64_t)count << 32; > > + newval = (uint64_t)(count + 1) << 32; > > +#elif BYTE_ORDER == BIG_ENDIAN > > + oldval = (uint64_t)count; > > + newval = (uint64_t)(count + 1); > > +#else > > +#error Unknown byte order > > +#endif > > + if (atomic_cmpset_rel_64((uint64_t *)&sem->_kern._count, > > + oldval, newval)) > > + return (0); > > + } > > + } > > +#endif > I think this looks ok, but it's a shame we can't fix it more cleanly. I don't like at all that you have to either force an exotic alignment or use unaligned atomic ops. I saw another interesting sem_ implementation in musl libc. It also uses a separate waiters word (but an int instead of a boolean), and is based on the futex calls (which we already have as wait_uint/wake umtx calls). It looks rather complicated but the musl libc people are usually good at getting this kind of stuff right. I suppose this will also break subtly when interoperating with old libc, and adding new umtx ops will simplify the algorithm. Consideration: just declare mixing process-shared semaphores with sufficiently different libc unsupported, and change SEM_MAGIC to enforce that? (This does not prevent running old binaries, as long as they're dynamically linked to libc and you use a new libc.so.) Consideration 2: use the new implementation only for process-private semaphores and keep using the old slow one for process-shared semaphores? -- Jilles Tjoelker
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140923212000.GA78110>