From owner-freebsd-arch@FreeBSD.ORG  Tue Mar 11 02:25:27 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B494F1065675
	for <arch@freebsd.org>; Tue, 11 Mar 2008 02:25:27 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id 5F9368FC16
	for <arch@freebsd.org>; Tue, 11 Mar 2008 02:25:27 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com
	[24.94.75.93]) (authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	m2B2PPaL045126
	for <arch@freebsd.org>; Mon, 10 Mar 2008 22:25:26 -0400 (EDT)
	(envelope-from jroberson@chesapeake.net)
Date: Mon, 10 Mar 2008 16:26:17 -1000 (HST)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@desktop
To: arch@freebsd.org
Message-ID: <20080310161115.X1091@desktop>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: amd64 cpu_switch in C.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Mar 2008 02:25:27 -0000

http://people.freebsd.org/~jeff/amd64.diff

At the above address there is an implementation of cpu_switch() and 
cpu_throw() for amd64 almost entirely in C.  I'm posting this for 
discussion and eventual commit.  There are numerous reasons to do this, I 
will outline some of them.

Implementing the bulk of the code in C allows us to add/modify higher 
level features more easily.  For example, we can change the pmap active 
bits to use a cpuset_t so we can support more than 64 cpus.  It makes the 
code faster because we can do more complicated checks to save time, such 
as avoiding writing the fs/gsbase MSRs if they have not changed.  It makes 
the code faster because infrequently used options can be moved out of the 
normal code paths.

In fact, the c version is ~10% faster than the assembly version at a two 
thread sched_yield() test on a single cpu opteron:

x asm.yield
+ csw.yield
+------------------------------------------------------------------------------+
|     ++                                              x  x 
|
|+ ++ ++ +  + +          +  +   ++ +x    x     x      x  xxx 
x|
| |______M_____A___________|               |__________AM__________| 
|
+------------------------------------------------------------------------------+
     N           Min           Max        Median           Avg 
Stddev
x  10          5.17          5.88           5.5         5.479 
0.19272606
+  15          4.58          5.16          4.71     4.8126667 
0.20738049
Difference at 95.0% confidence
         -0.666333 +/- 0.170431
         -12.1616% +/- 3.11062%
         (Student's t, pooled s = 0.201773)

This test measures the total time to call sched_yield() 10,000,000 times 
between two threads.  Two threads are needed to be sure that the scheduler 
doesn't pick the same thread twice and skip cpu_switch().  The 10% speedup 
is notable because the cpu_switch() routine was consuming less than 40% of 
the cpu prior to the speedup.  So it's almost 1/3rd faster.

Peter also suggested that we can delay portions of the switch until the 
user boundary.  For workloads that involve heavy kernel activity on the 
users part with multiple switches per-syscall this would be a big savings. 
We could also use this as a framework to implement custom switch routines 
if we want to switch directly to ithreads or taskqueue threads in the 
future.

The C routine is supplemented by two assembly routines which are 
responsible for saving the core architecture state and manipulating the 
stack.  These total approximately 50 assembly instructions and are similar 
to savecontext/swapcontext.

The c code saves the old threads context but still runs on its stack as it 
continues the switch.  This is safe because the old thread is locked until 
we call "cpu_switchin()" which is similar to swapcontext.

The only appreciable downside is that it lowers the barrier of entry for 
modifying a very sensitive piece of code.  Still, I think the flexibility 
it gives us outweighs those concerns.

Comments?

Thanks,
Jeff