Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 21 Mar 2010 10:13:01 -0600
From:      Scott Long <scottl@samsco.org>
To:        Alexander Motin <mav@freebsd.org>
Cc:        freebsd-current@freebsd.org, Ivan Voras <ivoras@freebsd.org>, freebsd-arch@freebsd.org
Subject:   Re: Increasing MAXPHYS
Message-ID:  <D9D66012-16FD-4FB6-AB6A-9A8D17727901@samsco.org>
In-Reply-To: <4BA6279E.3010201@FreeBSD.org>
References:  <1269109391.00231800.1269099002@10.7.7.3>	<1269120182.00231865.1269108002@10.7.7.3>	<1269120188.00231888.1269109203@10.7.7.3>	<1269123795.00231922.1269113402@10.7.7.3>	<1269130981.00231933.1269118202@10.7.7.3>	<1269130986.00231939.1269119402@10.7.7.3> <1269134581.00231948.1269121202@10.7.7.3> <1269134585.00231959.1269122405@10.7.7.3> <4BA6279E.3010201@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Mar 21, 2010, at 8:05 AM, Alexander Motin wrote:

> Ivan Voras wrote:
>> Julian Elischer wrote:
>>> You can get better throughput by using TSC for timing because the =
geom
>>> and devstat code does a bit of timing.. Geom can be told to turn off
>>> it's timing but devstat can't. The 170 ktps is with TSC as timer,
>>> and geom timing turned off.
>>=20
>> I see. I just ran randomio on a gzero device and with 10 userland
>> threads (this is a slow 2xquad machine) I get g_up and g_down =
saturated
>> fast with ~~ 120 ktps. Randomio uses gettimeofday() for measurements.
>=20
> I've just got 140Ktps from two real Intel X25-M SSDs on ICH10R AHCI
> controller and single Core2Quad CPU. So at least on synthetic tests it
> is potentially reachable even with casual hardware, while it =
completely
> saturated quad-core CPU.
>=20
>> Hmm, it looks like it could be easy to spawn more g_* threads (and,
>> barring specific class behaviour, it has a fair chance of working out =
of
>> the box) but the incoming queue will need to also be broken up for
>> greater effect.
>=20
> According to "notes", looks there is a good chance to obtain races, as
> some places expect only one up and one down thread.
>=20

I agree that more threads just creates many more race complications.  =
Even if it didn't, the storage driver is a serialization point; it =
doesn't matter if you have a dozen g_* threads if only one of them can =
be in the top half of the driver at a time.  No amount of fine-grained =
locking is going to help this.

I'd like to go in the opposite direction.  The queue-dispatch-queue =
model of GEOM is elegant and easy to extend, but very wasteful for the =
simple case, where the simple case is one or two simple partition =
transforms (mbr, bsdlabel) and/or a simple stripe/mirror transform.  =
None of these need a dedicated dispatch context in order to operate.  =
What I'd like to explore is compiling the GEOM stack at creation time =
into a linear array of operations that happen without a g_down/g_up =
context switch.  As providers and consumers taste each other and build a =
stack, that stack gets compiled into a graph, and that graph gets =
executed directly from the calling context, both from the dev_strategy() =
side on the top and the bio_done() on the bottom.  GEOM classes that =
need a detached context can mark themselves as such, doing so will =
prevent a graph from being created, and the current dispatch model will =
be retained.

I expect that this will reduce i/o latency by a great margin, thus =
directly addressing the performance problem that FusionIO makes an =
example of.  I'd like to also explore having the g_bio model not require =
a malloc at every stage in the stack/graph; even though going through =
UMA is fairly fast, it still represents overhead that can be eliminated. =
 It also represents an out-of-memory failure case that can be prevented.

I might try to work on this over the summer.  It's really a research =
project in my head at this point, but I'm hopeful that it'll show =
results.

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D9D66012-16FD-4FB6-AB6A-9A8D17727901>