From owner-freebsd-current@FreeBSD.ORG  Sun Mar 21 16:13:07 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 052F3106564A;
	Sun, 21 Mar 2010 16:13:07 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id A9AA88FC21;
	Sun, 21 Mar 2010 16:13:04 +0000 (UTC)
Received: from phobos.samsco.home (phobos.samsco.home [192.168.254.11])
	(authenticated bits=0)
	by pooker.samsco.org (8.14.3/8.14.3) with ESMTP id o2LGD1Tn036769;
	Sun, 21 Mar 2010 10:13:01 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Mime-Version: 1.0 (Apple Message framework v1077)
Content-Type: text/plain; charset=us-ascii
From: Scott Long <scottl@samsco.org>
In-Reply-To: <4BA6279E.3010201@FreeBSD.org>
Date: Sun, 21 Mar 2010 10:13:01 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <D9D66012-16FD-4FB6-AB6A-9A8D17727901@samsco.org>
References: <1269109391.00231800.1269099002@10.7.7.3>	<1269120182.00231865.1269108002@10.7.7.3>	<1269120188.00231888.1269109203@10.7.7.3>	<1269123795.00231922.1269113402@10.7.7.3>	<1269130981.00231933.1269118202@10.7.7.3>	<1269130986.00231939.1269119402@10.7.7.3>
	<1269134581.00231948.1269121202@10.7.7.3>
	<1269134585.00231959.1269122405@10.7.7.3>
	<4BA6279E.3010201@FreeBSD.org>
To: Alexander Motin <mav@freebsd.org>
X-Mailer: Apple Mail (2.1077)
X-Spam-Status: No, score=-1.0 required=3.8 tests=ALL_TRUSTED
	autolearn=unavailable version=3.3.0
X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org
Cc: freebsd-current@freebsd.org, Ivan Voras <ivoras@freebsd.org>,
	freebsd-arch@freebsd.org
Subject: Re: Increasing MAXPHYS
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 21 Mar 2010 16:13:07 -0000


On Mar 21, 2010, at 8:05 AM, Alexander Motin wrote:

> Ivan Voras wrote:
>> Julian Elischer wrote:
>>> You can get better throughput by using TSC for timing because the =
geom
>>> and devstat code does a bit of timing.. Geom can be told to turn off
>>> it's timing but devstat can't. The 170 ktps is with TSC as timer,
>>> and geom timing turned off.
>>=20
>> I see. I just ran randomio on a gzero device and with 10 userland
>> threads (this is a slow 2xquad machine) I get g_up and g_down =
saturated
>> fast with ~~ 120 ktps. Randomio uses gettimeofday() for measurements.
>=20
> I've just got 140Ktps from two real Intel X25-M SSDs on ICH10R AHCI
> controller and single Core2Quad CPU. So at least on synthetic tests it
> is potentially reachable even with casual hardware, while it =
completely
> saturated quad-core CPU.
>=20
>> Hmm, it looks like it could be easy to spawn more g_* threads (and,
>> barring specific class behaviour, it has a fair chance of working out =
of
>> the box) but the incoming queue will need to also be broken up for
>> greater effect.
>=20
> According to "notes", looks there is a good chance to obtain races, as
> some places expect only one up and one down thread.
>=20

I agree that more threads just creates many more race complications.  =
Even if it didn't, the storage driver is a serialization point; it =
doesn't matter if you have a dozen g_* threads if only one of them can =
be in the top half of the driver at a time.  No amount of fine-grained =
locking is going to help this.

I'd like to go in the opposite direction.  The queue-dispatch-queue =
model of GEOM is elegant and easy to extend, but very wasteful for the =
simple case, where the simple case is one or two simple partition =
transforms (mbr, bsdlabel) and/or a simple stripe/mirror transform.  =
None of these need a dedicated dispatch context in order to operate.  =
What I'd like to explore is compiling the GEOM stack at creation time =
into a linear array of operations that happen without a g_down/g_up =
context switch.  As providers and consumers taste each other and build a =
stack, that stack gets compiled into a graph, and that graph gets =
executed directly from the calling context, both from the dev_strategy() =
side on the top and the bio_done() on the bottom.  GEOM classes that =
need a detached context can mark themselves as such, doing so will =
prevent a graph from being created, and the current dispatch model will =
be retained.

I expect that this will reduce i/o latency by a great margin, thus =
directly addressing the performance problem that FusionIO makes an =
example of.  I'd like to also explore having the g_bio model not require =
a malloc at every stage in the stack/graph; even though going through =
UMA is fairly fast, it still represents overhead that can be eliminated. =
 It also represents an out-of-memory failure case that can be prevented.

I might try to work on this over the summer.  It's really a research =
project in my head at this point, but I'm hopeful that it'll show =
results.

Scott