Date: Wed, 05 May 2004 19:54:53 -0600 From: Scott Long <scottl@freebsd.org> To: Gerrit Nagelhout <gnagelhout@sandvine.com> Cc: 'Andrew Gallatin' <gallatin@cs.duke.edu> Subject: Re: 4.7 vs 5.2.1 SMP/UP bridging performance Message-ID: <40999AED.9080606@freebsd.org> In-Reply-To: <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com> References: <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Gerrit Nagelhout wrote: > Andrew Gallatin wrote: > >>Bruce Evans writes: >> >> > >> > Athlon XP2600 UP system: !SMP case: 22 cycles SMP case: >>37 cycles >> > Celeron 366 SMP system: 35 48 >> > >> > The extra cycles for the SMP case are just the extra cost >>of a one lock >> > instruction. Note that SMP should cost twice as much >>extra, but the >> > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by >>using xchgl >> > which always locks the bus. After fixing this: >> > >> > Athlon XP2600 UP system: !SMP case: 6 cycles SMP case: >>37 cycles >> > Celeron 366 SMP system: 10 48 >> > >> > Mutexes take longer than simple locks, but not much longer >>unless the >> > lock is contested. In particular, they don't lock the bus any more >> > and the extra cycles for locking dominate (even in the >>!SMP case due >> > to the pessimization). >> > >> > So there seems to be something wrong with your benchmark. >>Locking the >> > bus for the SMP case always costs about 20+ cycles, but this hasn't >> > changed since RELENG_4 and mutexes can't be made much faster in the >> > uncontested case since their overhead is dominated by the bus lock >> > time. >> > >> >>Actually, I think his tests are accurate and bus locked instructions >>take an eternity on P4. See >>http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html >> >>For example, with your test above, I see 212 cycles for the UP case on >>a 2.53GHz P4. Replacing the atomic_store_rel_int(&slock, 0) with a >>simple slock = 0; reduces that count to 18 cycles. >> >>If its really safe to remove the xchg* from non-SMP atomic_store_rel*, >>then I think you should do it. Of course, that still leaves mutexes >>as very expensive on SMP (253 cycles on the 2.53GHz from above). >> >>Drew >> > > > I wonder if there is anything that can be done to make the locking more > efficient for the Xeon. Are there any other locking types that could > be used instead? > This might also explain why we are seeing much worse system call > performance under 4.7 in SMP versus UP. Here is a table of results > for some system call tests I ran. (The numbers are calls/s) Int 0x80 system calls are known to be extremely expensive on a P4. I think that Jeff Roberson measured them as taking 300 cycles on average. Some work was done on implementing the alternate sysenter/sysexit method, but I don't think it was ever finished. I think that it was shown to have a modest speed improvement, but there was still a lot of overhead that made it slow on a P4. There are other optimizations that can be done like having a shared page that lets you avoid calls like getpid and gettimeofday, but it opens some security risks that have to be dealt with. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?40999AED.9080606>