Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Oct 2012 11:35:03 +0200
From:      Andre Oppermann <andre@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        src-committers@freebsd.org, svn-src-user@freebsd.org
Subject:   Re: svn commit: r241889 - in user/andre/tcp_workqueue/sys: arm/arm cddl/compat/opensolaris/kern cddl/contrib/opensolaris/uts/common/dtrace cddl/contrib/opensolaris/uts/common/fs/zfs ddb dev/acpica dev/...
Message-ID:  <508664C7.3080206@freebsd.org>
In-Reply-To: <20121023032743.K2804@besplex.bde.org>
References:  <201210221418.q9MEINkr026751@svn.freebsd.org> <20121023032743.K2804@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 22.10.2012 19:10, Bruce Evans wrote:
> On Mon, 22 Oct 2012, Andre Oppermann wrote:
>
>> Log:
>>  Second pass at making global mutexes use MTX_DEF_SYSINIT().
>
>> Modified: user/andre/tcp_workqueue/sys/arm/arm/busdma_machdep.c
>> ==============================================================================
>> --- user/andre/tcp_workqueue/sys/arm/arm/busdma_machdep.c    Mon Oct 22 14:10:17 2012    (r241888)
>> +++ user/andre/tcp_workqueue/sys/arm/arm/busdma_machdep.c    Mon Oct 22 14:18:22 2012    (r241889)
>> @@ -163,9 +163,7 @@ static TAILQ_HEAD(,bus_dmamap) dmamap_fr
>> #define BUSDMA_STATIC_MAPS    500
>> static struct bus_dmamap map_pool[BUSDMA_STATIC_MAPS];
>>
>> -static struct mtx busdma_mtx;
>> -
>> -MTX_SYSINIT(busdma_mtx, &busdma_mtx, "busdma lock", MTX_DEF);
>> +static MTX_DEF_SYSINIT(busdma_mtx, "busdma lock", MTX_DEF);
>
> The verboser (sic) name is uglier.
>
> But once it says DEF, the arg list shouldn't say DEF.  Either this is
> DEF for something unrelated to MTX_DEF, in which case it is confusing
> and the confusion is made worse by having MTX_ precede DEF in both,
> or it is the same DEF and the arg is redundant, or it implies MTX_DEF
> by default but this can be changed, in which case it is confusing.

You're right, it is a bit confusing.  However the DEF in MTX_DEF_SYSINIT
stands for DEFining the global variable whereas MTX_DEF stands for a
default type mutex.

Unfortunately I couldn't come up with a bette short name for MTX_DEF_SYSINIT.
I'm open for suggestions though.

> The first seems to apply, since some calls use MTX_SPIN:
>
>> ...
>> Modified: user/andre/tcp_workqueue/sys/dev/ath/ah_osdep.c
>> ==============================================================================
>> --- user/andre/tcp_workqueue/sys/dev/ath/ah_osdep.c    Mon Oct 22 14:10:17 2012    (r241888)
>> +++ user/andre/tcp_workqueue/sys/dev/ath/ah_osdep.c    Mon Oct 22 14:18:22 2012    (r241889)
>> @@ -70,9 +70,7 @@
>>  * XXX This is a global lock for now; it should be pushed to
>>  * a per-device lock in some platform-independent fashion.
>>  */
>> -struct mtx ah_regser_mtx;
>> -MTX_SYSINIT(ah_regser, &ah_regser_mtx, "Atheros register access mutex",
>> -    MTX_SPIN);
>> +MTX_DEF_SYSINIT(ah_regser_mtx, "Atheros register access mutex", MTX_SPIN);
>
> So this is an MTX_DEF that gives MTX_SPIN but not MTX_DEF?!
>
> Always aligning to the cache line size can't be quite right.  It over-uses
> addresses that are zero modulo the line size, especially if the alignment
> is used for objects smaller than the line size.  It doesn't give
> determinism, since cache collisions are still possible due to shortages
> of tags, especially with lots of objects at the same alignment, and
> especially with caches of limited associativity.  And without a special
> section for the global aligned objects, you might be unlucky and have
> them all have offset 0 modulo PAGE_SIZE or similar, and thus have more
> than the average number of tag collisions.

The alignment is on a cache line.  It doesn't overuse zero modulo
addresses because for all addresses within a cache line their lower
bits are being truncated anyway.  If it were aligned to page size
I'd agree on the collision part, but it's not.

Getting unlucky by having two mutexes collide by associativity is
less of a chance than having two mutexes on the same cache line.
Also, should this rare event happen, associativity is being improved
in every future generation of high-powered CPU's.  And that is where
is matters because of high rates of items processed and thus locks
obtained.

> In a test program in userland, I notices many very large performance
> differences from alignment, and sometimes even larger performance
> losses from trying tpo hard to line things up.  The program is a
> relatively simple floating point benchmark.  The problems are largest
> in long double precisions.  Long doubles take 16 bytes each, and are
> rather exessively aligned by default, and gcc uses far too much stack
> space for them.  Despite the stack being 16-byte aligned, modifying
> the environment using something like CFLAGS=-O would change the alignment
> mod 64 or something significantly enough to give performance differences
> of up to 50%.  The following didn't quite work to give determinism:
> - -mpreferred-stack-boundary=6 or more.  This gives 64-byte or more
>    stack alignment, except it is not supported by clang (clang generally
>    handles stack alignment better, but you can't control it).  This also
>    takes much more stack, and sometimes makes the problem worse.  But it
>    does make the problem harder to hit by modifiying the environment.
>    Now larger environment changes are normally need to move the stack
>    to the next 64-byte boundary.
> - align the stack initially to PAGE_SIZE, then try all adjustments of
>    that are multiples of 8 mod PAGE_SIZE.  Don't use
>    -mpreferred-stack-boundary now.  The best multiples seemed to be
>    44 * 8 for amd64 and 43 * 8 for i386.  (This is on core2 (ref*).).
>    This works better than the above, and is fairly deterministic.
>    But there are still some subtests that run much slower than they
>    should and run at full speed with a different multiple.  Their
>    stack use is not signifcantly different, so I suspect some collision
>    with aligned addresses in the data or text sections.  The tests are
>    very repetitive, so any cache collisions would probably give the
>    20-50% slowdowns if the repeated.

These tests are neither relevant nor correct for the issue at hand.

> Some caching schemes use random replacement since nothing is worse than
> eaccidentally excessive alignment that causes a cache collision
> repeatedly.

The reason for mutex cache line alignment is totally different.

Whenever a mutex is aquired there is a value written to its memory
location with appropriate memory bus locking semantics.  Depending
on the architecture it is handled in memory and cache coherency
hardware or it hits main memory and invalidates all other CPU caches.
Both are very expensive operations.  If a lock is successfully obtained
its cache line has become invalid on all other CPU's.  The same for
when it is released.  A lock that is frequently obtained on multiple
CPU's causes a great deal of cache line invalidation and bouncing
coupled with expensive bus locked instructions.  With a spin lock
it gets even worse as all blocked CPU's constantly try write to write
to the memory location.

The worst case is two or even more distinct locks sharing the same
cache line.  There every operation on one lock causes the other to
get evicted from the cache line as well.  Add in some contention
for the other lock and performance goes down the drain.

This false sharing of cache lines by multiple locks is what the cache
line alignment tries to solve.

-- 
Andre




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?508664C7.3080206>