From owner-freebsd-arch@FreeBSD.ORG Tue Mar 11 02:25:27 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B494F1065675 for ; Tue, 11 Mar 2008 02:25:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 5F9368FC16 for ; Tue, 11 Mar 2008 02:25:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m2B2PPaL045126 for ; Mon, 10 Mar 2008 22:25:26 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Mon, 10 Mar 2008 16:26:17 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: arch@freebsd.org Message-ID: <20080310161115.X1091@desktop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Mar 2008 02:25:27 -0000 http://people.freebsd.org/~jeff/amd64.diff At the above address there is an implementation of cpu_switch() and cpu_throw() for amd64 almost entirely in C. I'm posting this for discussion and eventual commit. There are numerous reasons to do this, I will outline some of them. Implementing the bulk of the code in C allows us to add/modify higher level features more easily. For example, we can change the pmap active bits to use a cpuset_t so we can support more than 64 cpus. It makes the code faster because we can do more complicated checks to save time, such as avoiding writing the fs/gsbase MSRs if they have not changed. It makes the code faster because infrequently used options can be moved out of the normal code paths. In fact, the c version is ~10% faster than the assembly version at a two thread sched_yield() test on a single cpu opteron: x asm.yield + csw.yield +------------------------------------------------------------------------------+ | ++ x x | |+ ++ ++ + + + + + ++ +x x x x xxx x| | |______M_____A___________| |__________AM__________| | +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 5.17 5.88 5.5 5.479 0.19272606 + 15 4.58 5.16 4.71 4.8126667 0.20738049 Difference at 95.0% confidence -0.666333 +/- 0.170431 -12.1616% +/- 3.11062% (Student's t, pooled s = 0.201773) This test measures the total time to call sched_yield() 10,000,000 times between two threads. Two threads are needed to be sure that the scheduler doesn't pick the same thread twice and skip cpu_switch(). The 10% speedup is notable because the cpu_switch() routine was consuming less than 40% of the cpu prior to the speedup. So it's almost 1/3rd faster. Peter also suggested that we can delay portions of the switch until the user boundary. For workloads that involve heavy kernel activity on the users part with multiple switches per-syscall this would be a big savings. We could also use this as a framework to implement custom switch routines if we want to switch directly to ithreads or taskqueue threads in the future. The C routine is supplemented by two assembly routines which are responsible for saving the core architecture state and manipulating the stack. These total approximately 50 assembly instructions and are similar to savecontext/swapcontext. The c code saves the old threads context but still runs on its stack as it continues the switch. This is safe because the old thread is locked until we call "cpu_switchin()" which is similar to swapcontext. The only appreciable downside is that it lowers the barrier of entry for modifying a very sensitive piece of code. Still, I think the flexibility it gives us outweighs those concerns. Comments? Thanks, Jeff