Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Oct 2012 04:10:31 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Andre Oppermann <andre@freebsd.org>
Cc:        src-committers@freebsd.org, svn-src-user@freebsd.org
Subject:   Re: svn commit: r241889 - in user/andre/tcp_workqueue/sys: arm/arm cddl/compat/opensolaris/kern cddl/contrib/opensolaris/uts/common/dtrace cddl/contrib/opensolaris/uts/common/fs/zfs ddb dev/acpica dev/...
Message-ID:  <20121023032743.K2804@besplex.bde.org>
In-Reply-To: <201210221418.q9MEINkr026751@svn.freebsd.org>
References:  <201210221418.q9MEINkr026751@svn.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 22 Oct 2012, Andre Oppermann wrote:

> Log:
>  Second pass at making global mutexes use MTX_DEF_SYSINIT().

> Modified: user/andre/tcp_workqueue/sys/arm/arm/busdma_machdep.c
> ==============================================================================
> --- user/andre/tcp_workqueue/sys/arm/arm/busdma_machdep.c	Mon Oct 22 14:10:17 2012	(r241888)
> +++ user/andre/tcp_workqueue/sys/arm/arm/busdma_machdep.c	Mon Oct 22 14:18:22 2012	(r241889)
> @@ -163,9 +163,7 @@ static TAILQ_HEAD(,bus_dmamap) dmamap_fr
> #define BUSDMA_STATIC_MAPS	500
> static struct bus_dmamap map_pool[BUSDMA_STATIC_MAPS];
>
> -static struct mtx busdma_mtx;
> -
> -MTX_SYSINIT(busdma_mtx, &busdma_mtx, "busdma lock", MTX_DEF);
> +static MTX_DEF_SYSINIT(busdma_mtx, "busdma lock", MTX_DEF);

The verboser (sic) name is uglier.

But once it says DEF, the arg list shouldn't say DEF.  Either this is
DEF for something unrelated to MTX_DEF, in which case it is confusing
and the confusion is made worse by having MTX_ precede DEF in both,
or it is the same DEF and the arg is redundant, or it implies MTX_DEF
by default but this can be changed, in which case it is confusing.
The first seems to apply, since some calls use MTX_SPIN:

> ...
> Modified: user/andre/tcp_workqueue/sys/dev/ath/ah_osdep.c
> ==============================================================================
> --- user/andre/tcp_workqueue/sys/dev/ath/ah_osdep.c	Mon Oct 22 14:10:17 2012	(r241888)
> +++ user/andre/tcp_workqueue/sys/dev/ath/ah_osdep.c	Mon Oct 22 14:18:22 2012	(r241889)
> @@ -70,9 +70,7 @@
>  * XXX This is a global lock for now; it should be pushed to
>  * a per-device lock in some platform-independent fashion.
>  */
> -struct mtx ah_regser_mtx;
> -MTX_SYSINIT(ah_regser, &ah_regser_mtx, "Atheros register access mutex",
> -    MTX_SPIN);
> +MTX_DEF_SYSINIT(ah_regser_mtx, "Atheros register access mutex", MTX_SPIN);

So this is an MTX_DEF that gives MTX_SPIN but not MTX_DEF?!

Always aligning to the cache line size can't be quite right.  It over-uses
addresses that are zero modulo the line size, especially if the alignment
is used for objects smaller than the line size.  It doesn't give
determinism, since cache collisions are still possible due to shortages
of tags, especially with lots of objects at the same alignment, and
especially with caches of limited associativity.  And without a special
section for the global aligned objects, you might be unlucky and have
them all have offset 0 modulo PAGE_SIZE or similar, and thus have more
than the average number of tag collisions.

In a test program in userland, I notices many very large performance
differences from alignment, and sometimes even larger performance
losses from trying tpo hard to line things up.  The program is a
relatively simple floating point benchmark.  The problems are largest
in long double precisions.  Long doubles take 16 bytes each, and are
rather exessively aligned by default, and gcc uses far too much stack
space for them.  Despite the stack being 16-byte aligned, modifying
the environment using something like CFLAGS=-O would change the alignment
mod 64 or something significantly enough to give performance differences
of up to 50%.  The following didn't quite work to give determinism:
- -mpreferred-stack-boundary=6 or more.  This gives 64-byte or more
   stack alignment, except it is not supported by clang (clang generally
   handles stack alignment better, but you can't control it).  This also
   takes much more stack, and sometimes makes the problem worse.  But it
   does make the problem harder to hit by modifiying the environment.
   Now larger environment changes are normally need to move the stack
   to the next 64-byte boundary.
- align the stack initially to PAGE_SIZE, then try all adjustments of
   that are multiples of 8 mod PAGE_SIZE.  Don't use
   -mpreferred-stack-boundary now.  The best multiples seemed to be
   44 * 8 for amd64 and 43 * 8 for i386.  (This is on core2 (ref*).).
   This works better than the above, and is fairly deterministic.
   But there are still some subtests that run much slower than they
   should and run at full speed with a different multiple.  Their
   stack use is not signifcantly different, so I suspect some collision
   with aligned addresses in the data or text sections.  The tests are
   very repetitive, so any cache collisions would probably give the
   20-50% slowdowns if the repeated.

Some caching schemes use random replacement since nothing is worse than
eaccidentally excessive alignment that causes a cache collision
repeatedly.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121023032743.K2804>