Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Sep 2014 14:42:27 -0700
From:      Alfred Perlstein <bright@mu.org>
To:        Jilles Tjoelker <jilles@stack.nl>, John Baldwin <jhb@freebsd.org>
Cc:        adrian@freebsd.org, freebsd-threads@freebsd.org
Subject:   Re: sem_post() performance
Message-ID:  <5421E943.3010107@mu.org>
In-Reply-To: <20140923212000.GA78110@stack.nl>
References:  <20140921213742.GA46868@stack.nl> <1531724.MPBlj40xOW@ralph.baldwin.cx> <20140923212000.GA78110@stack.nl>

next in thread | previous in thread | raw e-mail | index | archive | help

On 9/23/14 2:20 PM, Jilles Tjoelker wrote:
> On Mon, Sep 22, 2014 at 03:53:13PM -0400, John Baldwin wrote:
>> On Sunday, September 21, 2014 11:37:42 PM Jilles Tjoelker wrote:
>>> It has been reported that POSIX semaphores are slow, in contexts such as
>>> Python. Note that POSIX semaphores are the only synchronization objects
>>> that support use by different processes in shared memory; this does not
>>> work for mutexes and condition variables because they are pointers to
>>> the actual data structure.
>>> In fact, sem_post() unconditionally performs an umtx system call.
>> *sigh*  I was worried that that might be the case.
>>> To avoid both lost wakeups and possible writes to a destroyed semaphore,
>>> an uncontested sem_post() must check the _has_waiters flag atomically
>>> with incrementing _count.
>>> The proper way to do this would be to take one bit from _count and
>>> use it for the _has_waiters flag; the definition of SEM_VALUE_MAX
>>> permits this. However, this would require a new set of umtx
>>> semaphore operations and will break ABI of process-shared semaphores
>>> (things may break if an old and a new libc access the same semaphore
>>> over shared memory).
>>> This diff only affects 32-bit aligned but 64-bit misaligned
>>> semaphores on 64-bit systems, and changes _count and _has_waiters
>>> atomically using a 64-bit atomic operation. It probably needs a
>>> may_alias attribute for correctness, but <sys/cdefs.h> does not have
>>> a wrapper for that.
>> It wasn't clear on first reading, but you are using aliasing to get
>> around the need for new umtx calls by using a 64-bit atomic op to
>> adjust two ints at the same time, yes?  Note that since a failing
>> semaphore op calls into the kernel for the "hard" case, you might in
>> fact be able to change the ABI without breaking process-shared
>> semaphores.  That is, suppose you left 'has_waiters' as always true
>> and reused the high bit of count for has_waiters.
>> Would old binaries always trap into the kernel?  (Not sure they will,
>> especially the case where an old binary creates the semaphore, a new
>> binary would have to force has_waiters to true in every sem op, but
>> even that might not be enough.)
> I think that everything will break when a binary linked to old and new
> libcs use the same semaphore. If the new contested bit is set, the old
> sem_getvalue() will return garbage, the old sem_trywait() will fail even
> if the real count is greater than 0, the old sem_wait() and
> sem_timedwait() may spin if the real count is greater than 0 and the old
> sem_post() will fail with [EOVERFLOW].
>
> That the "hard" path always issues a system call does not help much,
> since the system calls do not write to _count (this is an throughput
> optimization, allowing a fast-path thread through while a slow-path
> thread is entering or leaving the kernel).
>
>>> Some x86 CPUs may cope with misaligned atomic ops without destroying
>>> performance (especially if they do not cross a cache line), so the
>>> alignment restriction could be relaxed to make the patch more practical.
>>> Many CPUs in the i386 architecture have a 64-bit atomic op (cmpxchg8b)
>>> which could be used here.
>>> This appears to restore performance of 10-stable uncontested semaphores
>>> with the strange alignment to 9-stable levels (a tight loop with
>>> sem_wait and sem_post). I have not tested in any real workload.
>>> Index: lib/libc/gen/sem_new.c
>>> ===================================================================
>>> --- lib/libc/gen/sem_new.c	(revision 269952)
>>> +++ lib/libc/gen/sem_new.c	(working copy)
>>> @@ -437,6 +437,32 @@ _sem_post(sem_t *sem)
>>>   	if (sem_check_validity(sem) != 0)
>>>   		return (-1);
>>>
>>> +#ifdef __LP64__
>>> +	if (((uintptr_t)&sem->_kern._count & 7) == 0) {
>>> +		uint64_t oldval, newval;
>>> +
>>> +		while (!sem->_kern._has_waiters) {
>>> +			count = sem->_kern._count;
>>> +			if (count + 1 > SEM_VALUE_MAX)
>>> +				return (EOVERFLOW);
>>> +			/*
>>> +			 * Expect _count == count and _has_waiters == 0.
>>> +			 */
>>> +#if BYTE_ORDER == LITTLE_ENDIAN
>>> +			oldval = (uint64_t)count << 32;
>>> +			newval = (uint64_t)(count + 1) << 32;
>>> +#elif BYTE_ORDER == BIG_ENDIAN
>>> +			oldval = (uint64_t)count;
>>> +			newval = (uint64_t)(count + 1);
>>> +#else
>>> +#error Unknown byte order
>>> +#endif
>>> +			if (atomic_cmpset_rel_64((uint64_t *)&sem->_kern._count,
>>> +			    oldval, newval))
>>> +				return (0);
>>> +		}
>>> +	}
>>> +#endif
>> I think this looks ok, but it's a shame we can't fix it more cleanly.
> I don't like at all that you have to either force an exotic alignment or
> use unaligned atomic ops.
>
> I saw another interesting sem_ implementation in musl libc. It also uses
> a separate waiters word (but an int instead of a boolean), and is based
> on the futex calls (which we already have as wait_uint/wake umtx calls).
> It looks rather complicated but the musl libc people are usually good at
> getting this kind of stuff right. I suppose this will also break subtly
> when interoperating with old libc, and adding new umtx ops will simplify
> the algorithm.
>
> Consideration: just declare mixing process-shared semaphores with
> sufficiently different libc unsupported, and change SEM_MAGIC to enforce
> that? (This does not prevent running old binaries, as long as they're
> dynamically linked to libc and you use a new libc.so.)
>
>
> Consideration 2: use the new implementation only for process-private
> semaphores and keep using the old slow one for process-shared
> semaphores?
>
I believe a flag day for "new/old" mutexes should be OK.    
(Consideration #1)

Would it be possible to compile world/kernel to keep old compat for 
those that *really* want to mix libraries and whatnot?

imo, if people really wanted to, we could backpatch old libc and distrib 
that as its own package.  Then we only break people derp enough to use 
.a files.

-Alfred



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5421E943.3010107>