Date: Mon, 2 Feb 2004 19:45:57 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: Andy Farkas <andyf@speednet.com.au> Cc: John Baldwin <jhb@FreeBSD.org> Subject: Re: cvs commit: src/sys/i386/i386 apic_vector.s src/sys/i386/isa atpic_vector.s Message-ID: <20040202175017.W1579@gamplex.bde.org> In-Reply-To: <20040202110437.N88162@hewey.af.speednet.com.au> References: <200401282044.i0SKi8Y6063747@repoman.freebsd.org> <20040202110437.N88162@hewey.af.speednet.com.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2 Feb 2004, Andy Farkas wrote: > On Wed, 28 Jan 2004, John Baldwin wrote: > > > Modified files: > > sys/i386/i386 apic_vector.s > > sys/i386/isa atpic_vector.s > > Log: > > Optimize the i386 interrupt entry code to not reload the segment registers > > if they already contain the correct kernel selectors. > > What effect on performance does this change have? It seems to be a rather > significant change to an important code path, or am I totally confused..? I measured it in userland and saw about -1 cycles/interrupt on an AthlonXP and about -22 cycles/interrupt on an old Celeron (negative means a pessimization). This optimization is hard to measure because it depends on branch prediction, and interrupts give very unpredictable branches. I tested with a predictable pattern of (branch, !branch, branch, !branch, ...) in userland to see the -1 and -22 cycle results. In any case, this optimization is not worth doing on these machines, since loading segment registers (at least with the same value) is not slow. It takes 3 cycles on AthlonXP's and not many more on old Celerons (6 at most). So a whole 9 or so cycles per interrupt is up for optimization on these machines. P4's are said to be much slower (100 cycles for a segreg load) but this disagrees with pentopt.pdf which says that that the load takes 6-8 cycles. > Also, you've changed: > > movl $KDSEL, %eax ; /* reload with kernel's data segment */ > > and, > > movl $KPSEL, %eax ; /* reload with per-CPU data segment */ > > to: > > mov $KDSEL,%ax ; /* load kernel ds, es and fs */ > > and, > > mov $KPSEL,%ax ; > > > Is this part of the optimisations? Or, could you briefly explain this > change? Thank you. This gives most of the -22 cycle optimization on Celerons. It is a small negative optimization on old machines and a relatively large negative optimization on PentiumPro class machines (PPro and Celeron, and probably P2 and P3 but not P4), but is harmless or a small positive optimization on Athlons. It gives an operand size prefix on all machines and partial register stalls on PPros. The partial register stalls are due to a gas bug assembling the segment register moves in the next instuctions: mov $KDSEL, %ax mov %ax, %ds # partial register stall mov %ax, %es # already stalled; probably not another one mov $KPSEL, %ax mov %ax, %fs # partial register stall Gas misassembles the apparent 16-bit moves to 32-bit ones. See objdump or gdb output for a correct disassembly of the generated code -- it doesn't match the source code. Segment registers have only 16 bits, so the top 16 bits are thrown away by the CPU; however, this is apparently done in a late stage of the pipeline after the stall occurs. On old Celerons, each partial register stall takes longer than non-stalling loading all 3 segment registers. This gives the -22 cycle optimization. Old code avoided the stalls accidentally by being optimized to avoid the operand size prefixes (since these just waste cycles on old CPUs; on current CPUs they are usually free because deep pipelines optimize them away). There was a movl to %eax to avoid a prefix for this instruction, and a hack to get the same result as the gas bug (no prefix and thus a 32-bit move). The hack was needed because gas bugs in this area used to be larger. The movl to %eax was already unoptimized for the !SMP case due to wrong fixes for warnings about the hack that gas started emitting when it started understanding operand sizes better. The gas bug and the operand size prefixes are now easy to avoid using 32-bit moves for everything: movl $KDSEL,%eax movl %eax,%ds # actually assembled correctly Gas assembles 32-bit moves _from_ segment registers correctly. The gas bug is presumably the result of incomplete and confusing documentation about this. The i386 and i486 manuals barely mention the effect of the prefix. My assembler gets it wrong for both directions by never generating a prefix, and it doesn't permit moves between segment registers and 32-bit general registers. This is based on a literal reading of the opcodes in the table of mov's in the i386 manual (there are no prefixes there). However, with no prefix such moves are actually 32-bit in 32-bit mode (except for moves to segment registers the non- no-op-ness of the missing prefix is only visible as a partial register stall). Current Intel manuals still don't mention prefixes in the table, but have a lot of notes about them. They warn that some assemblers insert a useless prefix and recommend using "MOV DS,EAX" (Intel Syntax) to avoid it. However, a prefix for "MOV DS,AX" is not always useless since it may avoid a partial register stall, so "MOV DS,AX" should give the prefix unconditionally and "MOV DS,EAX" for avoiding it should be more than a recommendation. Current Intel manuals also document the effect of 32-bit moves from segment registers (the top bits are undefined for old CPUs and 0 for new ones). See another thread and the commit logs for the i386 cpufunc.h about undoing old operand size prefix optimizations for this direction. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040202175017.W1579>