From owner-svn-src-all@FreeBSD.ORG Mon Jun 20 23:41:09 2011 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from [127.0.0.1] (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by hub.freebsd.org (Postfix) with ESMTP id E7828106564A; Mon, 20 Jun 2011 23:41:08 +0000 (UTC) (envelope-from jkim@FreeBSD.org) From: Jung-uk Kim To: Bruce Evans Date: Mon, 20 Jun 2011 19:40:53 -0400 User-Agent: KMail/1.6.2 References: <201106172141.p5HLf6Rx009154@svn.freebsd.org> <20110618195655.M889@besplex.bde.org> In-Reply-To: <20110618195655.M889@besplex.bde.org> MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201106201940.56104.jkim@FreeBSD.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r223211 - head/sys/x86/x86 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Jun 2011 23:41:09 -0000 On Saturday 18 June 2011 07:08 am, Bruce Evans wrote: > On Fri, 17 Jun 2011, Jung-uk Kim wrote: > > Log: > > Teach the compiler how to shift TSC value efficiently. As noted > > in r220631, some times compiler inserts redundant instructions to > > preserve unused upper 32 bits even when it is casted to a 32-bit > > value. Unfortunately, it seems the problem becomes more serious > > when it is shifted, especially on amd64. > > Er, I tried to point out how to optimize this code before (but > didn't reply to your reply), and it's not by using more asm. > > > Modified: head/sys/x86/x86/tsc.c > > ================================================================= > >============= --- head/sys/x86/x86/tsc.c Fri Jun 17 21:31:13 > > 2011 (r223210) +++ head/sys/x86/x86/tsc.c Fri Jun 17 21:41:06 > > 2011 (r223211) @@ -461,7 +461,7 @@ init_TSC_tc(void) > > tsc_timecounter.tc_quality = 1000; > > > > init: > > - for (shift = 0; shift < 32 && (tsc_freq >> shift) > max_freq; > > shift++) + for (shift = 0; shift < 31 && (tsc_freq >> shift) > > > max_freq; shift++) ; > > if (shift > 0) { > > tsc_timecounter.tc_get_timecount = tsc_get_timecount_low; > > shift == 32 (or even shift == 31) is unreachable. A shift of 31 > will shift 2GHz down to 1 Hz, or support physically impossible > frequencies like 2**33 GHz. OTOH, shifts of up to 63 are supported > by the slow gcc code. It is unreachable but it is a safety belt. Please see below. > > @@ -579,6 +579,9 @@ tsc_get_timecount(struct timecounter *tc > > static u_int > > tsc_get_timecount_low(struct timecounter *tc) > > { > > + uint32_t rv; > > > > - return (rdtsc() >> (int)(intptr_t)tc->tc_priv); > > + __asm __volatile("rdtsc; shrd %%cl, %%edx, %0" > > + : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); > > Lexical style bug (indentation of second line of the asm). I followed sys/amd64/include/atomic.h style. OTOH, sys/amd64/include/cpufunc.h has little different style. I wasn't sure what's correct because style(9) does not mention anything specific to inline assembly. :-( > > + return (rv); > > } > > Just return the shift of the low 32 bits (and change > tc_counter_mask to match) like I said. This loses only the > accidental ability for the timecounter to work for more than a few > seconds when interrupts are stopped by something like ddb, since > any shift count that loses too many of the low 32 bits will not > work for other reasons. For example, suppose that the TSC > frequency is 8G-1Hz, which is unavailable except possible in > research labs. This must be shifted by 1 to fit in 32 bits. If we > use only the low 32 bits, then we end up with only 31 significant > bits and tsc_get_timecount_low() wraps after ~2 seconds instead of > after the best possible for this shift of ~4 seconds. If we shift > by 7 more, as we do in the SMP case, then if we start with 32 bits > then we end up with 24 bits, but the wrap still takes 2 seconds; if > we start with 64 bits then we end up with 32 bits and the wrap > takes 4*2**7 = 512 seconds. But wrap times longer than 1/HZ times > a few are not needed. 2 seconds is already at least 100 or 1000 > times longer than needed, depending on HZ. The case where the > unscaled frequency is 4G-1Hz and !SMP gives a shift count of 0 and > a wrap time of ~4 seconds. Whatever is done to make that case work > (say, not allowing a fully tickless kernel with HZ = 0), works > almost as well up to an unscaled frequency of 8GHz which is still > far off. > > No one will notice these micro-optimizations, but I agree that the > redundant instructions are ugly. I get the following on i386 for > the original version with an old source tree: > > % #APP > % rdtsc > % #NO_APP > % movl 8(%ebp), %ecx > % movl 28(%ecx), %ecx > % shrdl %edx, %eax > % shrl %cl, %edx > % testb $32, %cl > % je .L3 > % movl %edx, %eax > % xorl %edx, %edx > % .L3: > > The last 4 instructions are not redundant, but are needed to > support shifts of up to 63 (maybe 64). They are not redundant but I wanted to get rid of the check entirely. That's why 32 -> 31 change was made in the first place. FYI, with amd64, we get this from the old code: rdtsc movzbl 48(%rdi), %ecx salq $32, %rdx mov %eax, %eax orq %rax, %rdx shrq %cl, %rdx movl %edx, %eax ret This is slightly better than i386 version, i.e., no conditional jump, but it is still ugly, however. > I tried masking the shift count with 0x1f so that the shift count > is known to be < 32, this just gave an extra instruction for the > masking. Yup. > It's even worse with rdtsc() converted to u_int first like I want: > > % movl %ebx, (%esp) > % movl %esi, 4(%esp) > % #APP > % rdtsc > % #NO_APP > % movl %eax, %ebx > % movl 8(%ebp), %eax > % movl 4(%esp), %esi > % movl 28(%eax), %ecx > % movl %ebx, %eax > % movl (%esp), %ebx > % # some frame pointer epilogue reordered here > % shrl %cl, %eax > > The second case may be what you already fixed on amd64 (only?) -- > use rdtsc32() instead of (u_int)rdtsc(). > > I've always thought that the dynamic shift is overengineered, and > now like it even less. The following is efficent and works well > enough in all currently physically possible cases: > > % /* > % * Don't really need a separate one for `low', but now it costs > less % * (1 shift instruction at runtime and some space). Must > change % * tc_counter_mask to match. > % */ > % u_int > % tsc_get_timecount_low(struct timecounter *tc) > % { > % #ifdef SMP > % /* > % * Works up to 1024 GHz, assuming that nontemporalness scales > with % * freq. I think 8 is too many. But now do extra for SMP > indep. % * of freq. > % */ > % return (((u_int)rdtsc()) >> 8); /* gens rdtsc; shrl $8,%eax */ > % #else > % /* Works up to 8 GHz. */ > % return (((u_int)rdtsc()) >> 1); /* gens rdtsc; shrl %eax */ > % #endif > % } Dynamic shift may be overengineered but it is useful. For example, r223211 give us fairly good optimization on amd64: movq 48(%rdi), %rcx rdtsc shrd %cl, %edx, %eax ret i386 isn't too impressive, though: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax movl 28(%eax), %ecx rdtsc shrd %cl, %edx, %eax popl %ebp ret I just couldn't convince GCC to generate code like this without inline asm, unfortunately. :-( Jung-uk Kim