From owner-svn-src-user@FreeBSD.ORG Mon Oct 22 17:10:38 2012 Return-Path: Delivered-To: svn-src-user@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E8D3A9DD; Mon, 22 Oct 2012 17:10:38 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail34.syd.optusnet.com.au (mail34.syd.optusnet.com.au [211.29.133.218]) by mx1.freebsd.org (Postfix) with ESMTP id 79FA28FC12; Mon, 22 Oct 2012 17:10:36 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail34.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q9MHAVAm007951 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 23 Oct 2012 04:10:32 +1100 Date: Tue, 23 Oct 2012 04:10:31 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Andre Oppermann Subject: Re: svn commit: r241889 - in user/andre/tcp_workqueue/sys: arm/arm cddl/compat/opensolaris/kern cddl/contrib/opensolaris/uts/common/dtrace cddl/contrib/opensolaris/uts/common/fs/zfs ddb dev/acpica dev/... In-Reply-To: <201210221418.q9MEINkr026751@svn.freebsd.org> Message-ID: <20121023032743.K2804@besplex.bde.org> References: <201210221418.q9MEINkr026751@svn.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-Cloudmark-Score: 0 X-Optus-Cloudmark-Analysis: v=2.0 cv=IbXfwhWa c=1 sm=1 a=yBc9O6YWil4A:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=71122zBos6gA:10 a=sqi-pC6oGpjiH1UNzKQA:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: src-committers@freebsd.org, svn-src-user@freebsd.org X-BeenThere: svn-src-user@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "SVN commit messages for the experimental " user" src tree" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Oct 2012 17:10:39 -0000 On Mon, 22 Oct 2012, Andre Oppermann wrote: > Log: > Second pass at making global mutexes use MTX_DEF_SYSINIT(). > Modified: user/andre/tcp_workqueue/sys/arm/arm/busdma_machdep.c > ============================================================================== > --- user/andre/tcp_workqueue/sys/arm/arm/busdma_machdep.c Mon Oct 22 14:10:17 2012 (r241888) > +++ user/andre/tcp_workqueue/sys/arm/arm/busdma_machdep.c Mon Oct 22 14:18:22 2012 (r241889) > @@ -163,9 +163,7 @@ static TAILQ_HEAD(,bus_dmamap) dmamap_fr > #define BUSDMA_STATIC_MAPS 500 > static struct bus_dmamap map_pool[BUSDMA_STATIC_MAPS]; > > -static struct mtx busdma_mtx; > - > -MTX_SYSINIT(busdma_mtx, &busdma_mtx, "busdma lock", MTX_DEF); > +static MTX_DEF_SYSINIT(busdma_mtx, "busdma lock", MTX_DEF); The verboser (sic) name is uglier. But once it says DEF, the arg list shouldn't say DEF. Either this is DEF for something unrelated to MTX_DEF, in which case it is confusing and the confusion is made worse by having MTX_ precede DEF in both, or it is the same DEF and the arg is redundant, or it implies MTX_DEF by default but this can be changed, in which case it is confusing. The first seems to apply, since some calls use MTX_SPIN: > ... > Modified: user/andre/tcp_workqueue/sys/dev/ath/ah_osdep.c > ============================================================================== > --- user/andre/tcp_workqueue/sys/dev/ath/ah_osdep.c Mon Oct 22 14:10:17 2012 (r241888) > +++ user/andre/tcp_workqueue/sys/dev/ath/ah_osdep.c Mon Oct 22 14:18:22 2012 (r241889) > @@ -70,9 +70,7 @@ > * XXX This is a global lock for now; it should be pushed to > * a per-device lock in some platform-independent fashion. > */ > -struct mtx ah_regser_mtx; > -MTX_SYSINIT(ah_regser, &ah_regser_mtx, "Atheros register access mutex", > - MTX_SPIN); > +MTX_DEF_SYSINIT(ah_regser_mtx, "Atheros register access mutex", MTX_SPIN); So this is an MTX_DEF that gives MTX_SPIN but not MTX_DEF?! Always aligning to the cache line size can't be quite right. It over-uses addresses that are zero modulo the line size, especially if the alignment is used for objects smaller than the line size. It doesn't give determinism, since cache collisions are still possible due to shortages of tags, especially with lots of objects at the same alignment, and especially with caches of limited associativity. And without a special section for the global aligned objects, you might be unlucky and have them all have offset 0 modulo PAGE_SIZE or similar, and thus have more than the average number of tag collisions. In a test program in userland, I notices many very large performance differences from alignment, and sometimes even larger performance losses from trying tpo hard to line things up. The program is a relatively simple floating point benchmark. The problems are largest in long double precisions. Long doubles take 16 bytes each, and are rather exessively aligned by default, and gcc uses far too much stack space for them. Despite the stack being 16-byte aligned, modifying the environment using something like CFLAGS=-O would change the alignment mod 64 or something significantly enough to give performance differences of up to 50%. The following didn't quite work to give determinism: - -mpreferred-stack-boundary=6 or more. This gives 64-byte or more stack alignment, except it is not supported by clang (clang generally handles stack alignment better, but you can't control it). This also takes much more stack, and sometimes makes the problem worse. But it does make the problem harder to hit by modifiying the environment. Now larger environment changes are normally need to move the stack to the next 64-byte boundary. - align the stack initially to PAGE_SIZE, then try all adjustments of that are multiples of 8 mod PAGE_SIZE. Don't use -mpreferred-stack-boundary now. The best multiples seemed to be 44 * 8 for amd64 and 43 * 8 for i386. (This is on core2 (ref*).). This works better than the above, and is fairly deterministic. But there are still some subtests that run much slower than they should and run at full speed with a different multiple. Their stack use is not signifcantly different, so I suspect some collision with aligned addresses in the data or text sections. The tests are very repetitive, so any cache collisions would probably give the 20-50% slowdowns if the repeated. Some caching schemes use random replacement since nothing is worse than eaccidentally excessive alignment that causes a cache collision repeatedly. Bruce