Date: Sun, 18 Jun 2006 03:30:29 GMT From: Bruce Evans <bde@zeta.org.au> To: freebsd-bugs@FreeBSD.org Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu Message-ID: <200606180330.k5I3UTWJ036967@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/98460; it has been noted by GNATS. From: Bruce Evans <bde@zeta.org.au> To: Rostislav Krasny <rosti.bsd@gmail.com> Cc: freebsd-gnats-submit@freebsd.org Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu Date: Sun, 18 Jun 2006 13:30:09 +1000 (EST) On Sun, 18 Jun 2006, Rostislav Krasny wrote: > On Sat, 17 Jun 2006 17:01:27 +1000 (EST) > Bruce Evans <bde@zeta.org.au> wrote: > >> On Fri, 16 Jun 2006, Rostislav Krasny wrote: >>> ,,, >>> I think it is a matter of principle. AMD saved few microcomands in >>> their incorrect implementation of two Pentium III instructions. And now >>> buyers if their processors are paying much more than those few >>> microcomands. >> >> No, the non-AMD users pay much less (unless the cost of branch prediction >> is very large). When I tried to measure the overhead for the fix, I found >> that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an >> Athlon(XP,64). That's about 150 cycles longer IIRC. The fix costs only >> 14 cycles. > > Yes, according to > http://security.freebsd.org/advisories/FreeBSD-SA-06:14-amd.txt > the "FXRSTOR-centric" method takes 14 cycles on AMD Opteron processor. > That is the minimum which AMD users need to pay now. Non-AMD users have > four options: I confirmed the ~14 cycle value in a micro-benchmark but don't really believe it. The difficulty of accounting for cache misses of various types (perhaps main branch target cache here) is shown partly by the AMD statement not even mentioning caches. > 1. run the same instructions down the drain > 2. test some flag > 3. jump over these instructions > 4. disable these instructions in the kernel build configuration 5. Replace these instructions by no-op instructions. (This can be done at no cost for many bytes of instructions on CPUs with micro-ops, but but costs up to 2 (?) cycles per byte on old i386's.) 6. Change the pointer to Xdna in the IDT to a pointer to a version without these instructions. 7. Change Xdna (and/or routines that it calls, preferably none) to a version without these or hundreds of other instructions. 8. Do some of the above for all branches and/or routine in the kernel to avoid hundreds of thousands of branches and other instructions. 9. Use another method to expolit parallelism better. fldl after fxsave is probably better for parallelism. > Now, how much it will cost them: > > 1. same 14 cycles (?) > 2. minimum 20 cycles on NetBurst or about 15 cycles on Pentium III > http://www.intel.com/cd/ids/developer/asmo-na/eng/44010.htm?prn=Y > plus 1 or 2 microcomands for BT or TEST instruction. > 3. 1 microcomand for one direct JMP > 4. nothing 1. Possibly 14, probably more, but possibly less due to parallelism. 2. Now at most 2 on modern CPUs under the same bad assumptions that give 14 for (1). 3. Direct jumps sometimes take just as long as conditional jumps on some CPUs (I think due to them not beng cached), but if something is sure to take only a single micro-op then there's a good chance of parallelism. 4. Probably, but possibly not since the extra code might accidentally improve instruction scheduling :-). 5. Like (3), except no-ops may reduce to 0 micro-ops instead of 1 and thus take 0 execution resources but some prefetch resources. 6. Like (4). 7. Like (6) repeated 50 times. Xdna could take 20 times fewer instructions but wouldn't be 20 times faster because the slow fxrstor instruction would dominate. 8. I think the potential savings from this huge task are about 10% for the kernel and some fraction of this for the system. 9. "fxsave; testl $FLAG,cpu_fxsr; jz 1f; fnstsw ...; cmp ...; jz; fnclex; fldl ...; 1:". Now the cpu_fxsr test and even the status test might be free even if there are a branch misprediction since there are no important data dependencies. If the CPU has enough execution units than it can do the following in parallel: FPU1 ALU1 FPU2 ALU[2-] FPU[2-] ---- ---- ---- ------- ------- fxsave testl $FLAG,cpu_fxr idle runs ahead runs ahead ... jz 1f idle ... ... ... ... fnstsw ... cmp ... ... jz ... runs ahead fnclex ... fldl runs ahead ... Some serializing instruction, probably iret: iret iret iret iret iret If the CPU soon returns to user mode then it will hit a serializing instruction soon, so it is important to start the slow fxsave instruction as early as possible so that everything doesn't have to wait for it. The npxsave() call in cpu_switch() was written about 13 years ago and the i386 cpu_switch() is more like 20 years old. It knows nothing about multiple execution units and happens to schedule the npx switch (actually the save half of a switch) almost perfectly pessimally by doing it near the end. However, mi_switch() has a lot of bloat so this probably doesn't matter -- the fxsave+fnclex sequence will complete before the bloat gets through the integer ALUs. I don't know if modern CPUs have this much parallelism. My (old, paper) AthlonXP optimization manual says that fnstsw runs in the FSTORE pipe and doesn't say which pipe(s) fxsave runs in, so I guess fnstsw has to wait for fxsave. You would like this since AthlonXPs would have to wait but Pentiums would proceed on all except ALU1 and FPU1 :-). > The last option has the best performance cost but kernel build options > are unhandy. Implementation of the third option is simple. Why not to > do it? Only one byte of the code will be self-modified. Because modifying only 1 byte in a 5MB library (the kernel) for a larger application (userland) would make little difference. >> 14 cycles is a lot from one point of view, but from a practical point >> of view it is the same as 0. Suppose that the kernel does 1000 context >> switches per second per CPU (too many for efficiency since it thrashes >> caches), and that an FPU switch occurs on all of these (it would >> normally be much less than that since half of all context switches are >> often to kernel threds (and half back), and many threads don't use the >> FPU. We then waste 14000 cycles per second + more for branch misprediction >> and other cache effects. At 2GHz 14000 cycles is a whole 7uS. > > How many cycles a context switch normally takes? About 1000 cycles? > Then 14 - 20 additional cycles take 1.4% - 2% of the previous context > switch time. Why to waste it? More like 2000 (best case). It was more like 1000 as recently as RELENG_4, but there have been many branches since then. On My AthlonXP @2223 MHz with a TSC timecounter, according to LMbench: % L M B E N C H 2 . 0 S U M M A R Y % ------------------------------------ % % Context switching - times in microseconds - smaller is better % ------------------------------------------------------------- % Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K % ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw % --------- ------------- ----- ------ ------ ------ ------ ------- ------- % epsplex.b FreeBSD 4.10- 0.370 0.6800 7.9100 2.2800 14.1 4.62000 55.9 % epsplex.b FreeBSD 5.2-C 0.830 1.3600 8.6200 3.2900 24.7 4.28000 58.5 0.370 uS is 823 cycles and 0.830 uS is 1845 cycles. The variance of these times is about 5%. LMbench's context switching doesn't exercise the FPU. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200606180330.k5I3UTWJ036967>