From owner-freebsd-hackers@FreeBSD.ORG Sat Mar 15 22:59:37 2014 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 25F7021B; Sat, 15 Mar 2014 22:59:37 +0000 (UTC) Received: from mx1.stack.nl (relay04.stack.nl [IPv6:2001:610:1108:5010::107]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id D9E4CE4B; Sat, 15 Mar 2014 22:59:36 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id A0A86B808E; Sat, 15 Mar 2014 23:59:34 +0100 (CET) Received: by snail.stack.nl (Postfix, from userid 1677) id 849B228497; Sat, 15 Mar 2014 23:59:34 +0100 (CET) Date: Sat, 15 Mar 2014 23:59:34 +0100 From: Jilles Tjoelker To: "Meyer, Conrad" Subject: Re: [PATCH] amd64/pcpu.h: Use Clang builtins for clarity when referencing thread's pcpu Message-ID: <20140315225934.GA22964@stack.nl> References: <1394821826-19412-1-git-send-email-conrad.meyer@isilon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1394821826-19412-1-git-send-email-conrad.meyer@isilon.com> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: "freebsd-hackers@freebsd.org" , Bryan Drewery X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2014 22:59:37 -0000 On Fri, Mar 14, 2014 at 06:31:08PM +0000, Meyer, Conrad wrote: > We can efficiently reference thread-local pcpu members via the %gs > register with Clang-annotated C code, in place of inline GNU assembly. > Motivations: > - Use C in leiu of inline assembly for clarity > - Clang's static analyser may be better able to understand PCPU_* > macros using the C constructs rather than inline assembly > (unverified) > Sponsored by: EMC/Isilon storage division > Signed-off-by: Conrad Meyer > Reviewed-by: Max Laier > --- > This is more of a "what do you think?" than a pull request. It seems > like using annotated C instead of asm is nice (in particular, Clang > detects casts from pointers typed with one segment to another, or > unsegmented type). On the other hand, this is code that doesn't change > frequently, and we may still need to support GCC for some time. So > adding a second, parallel implementation just doubles room for bugs. > Open questions: > - How long is GCC intended to be supported as a compiler? > - How atomic does PCPU_INC() need to be? It looks like it updates > cpu-local counters; as long as it's a single asm instruction, should > it be fine w.r.t. interrupts? The existing implementation does NOT > use the 'lock; ' prefix. > [snip] > +/* > + * Adds the value to the per-cpu counter name. The implementation > + * must be atomic with respect to interrupts. > + */ > +#define __PCPU_ADD(name, val) do { \ > + __pcpu_type(name) __val; \ > + volatile __pcpu_type(name) __GS_RELATIVE *__ptr; \ > + \ > + __val = (val); \ > + __ptr = __PCPU_PTR(name); \ > + *__ptr += __val; \ > +} while (0) > + > +/* > + * Increments the value of the per-cpu counter name. The implementation > + * must be atomic with respect to interrupts. > + */ > +#define __PCPU_INC(name) __PCPU_ADD(name, 1) This code does not force the compiler to generate the required read-modify-write instruction. The compiler may generate a load, add and store, which may corrupt the counter if the thread is preempted between the load and the store. With the %gs: prefix, a plain read-modify-write instruction is indeed sufficient. It is not possible to preempt the instruction between determining the linear address of the counter and incrementing the counter, or between reading and writing the value; therefore, each core only touches its own counter and a lock prefix is not required. (This is basically the same reason why the lock prefixes are #ifdef'ed out in a uniprocessor kernel.) If a compiler wanted to support this from C, I suggest extending the memory_order type with one or more values for synchronization within a core or within a thread which generate read-modify-write instructions without lock prefix on x86. -- Jilles Tjoelker