From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 20 12:13:47 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8D4661065672;
	Sun, 20 Dec 2009 12:13:47 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 48B0E8FC23;
	Sun, 20 Dec 2009 12:13:47 +0000 (UTC)
Received: from fledge.watson.org (fledge.watson.org [65.122.17.41])
	by cyrus.watson.org (Postfix) with ESMTPS id CD56646B09;
	Sun, 20 Dec 2009 07:13:46 -0500 (EST)
Date: Sun, 20 Dec 2009 12:13:46 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Harti Brandt <harti@freebsd.org>
In-Reply-To: <20091219164818.L1741@beagle.kn.op.dlr.de>
Message-ID: <alpine.BSF.2.00.0912201202520.73550@fledge.watson.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
	<20091219164818.L1741@beagle.kn.op.dlr.de>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: =?ISO-8859-15?Q?Ulrich_Sp=F6rlein?= <uqs@spoerlein.net>,
	Hans Petter Selasky <hselasky@c2i.net>, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Dec 2009 12:13:47 -0000


On Sat, 19 Dec 2009, Harti Brandt wrote:

> To be honest, I'm lost now. Couldn't we just use the largest atomic type for 
> the given platform and atomic_inc/atomic_add/atomic_fetch and handle the 
> 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel thread?
>
> Are the 5-6 atomic operations really that costly given the many operations 
> done on an IP packet? Are they more costly than a heavyweight sync for each 
> ++ or +=?

Frequent writes to the same cache line across multiple cores are remarkably 
expensive, as they trigger the cache coherency protocol (mileage may vary). 
For example, a single non-atomically incremented counter cut performance of 
gettimeofday() to 1/6th performance on an 8-core system when parallel system 
calls were made across all cores.  On many current systems, the cost of an 
"atomic" operation is now fairly reasonable as long as the cache line is held 
exclusively by the current CPU.  However, if we can avoid them that has value, 
as we update quite a few global stats on the way through the network stack.

> Or we could use the PCPU stuff, use just ++ and += for modifying the 
> statistics (32bit) and do the 32->64 bit stuff for all platforms with a 
> kernel thread per CPU (do we have this?). Between that thread and the sysctl 
> we could use a heavy sync.

The current short-term plan is to move do this but without a syncer thread: 
we'll just aggregate the results when they need to be reported, in the sysctl 
path.  How best to scale to 64-bit counters is an interesting question, but 
one we can address after per-CPU stats are in place, which address an 
immediate performance (rather than statistics accuracy) concern.

> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a 
> machine and do routing at link speed, though. This might overflow the IP 
> input/output byte counter (which we don't have yet) too fast.

For byte counters, assuming one 10gbps stream, a 32-bit counter wraps in about 
three seconds.  Systems processing 40gbps a second are now quite realistic, 
although typically workloads of that sort will be distributed over 16+ cores 
and using multiple 10gbps NICs.

My thinking is that we get the switch to per-CPU stats done in 9.x in the next 
month sometime, and also get it merged to 8.x a month or so later (I merged 
the wrapper macros necessary to do that before 8.0 but didn't have time to 
fully evaluate the performance implications of the implementation switch).

There are two known areas of problem here:

(1) The cross-product issue with virtual network stacks
(2) The cross-product issue with network interfaces for per-interface stats

I propose to ignore (1) for now by simply having only vnet0 use per-CPU stats, 
and other vnets use single-instance per-vnet stats.  We can solve the larger 
problem there at a future date.

I don't have a good proposal for (2) -- the answer may be using DPCPU memory, 
but that will require use to support more dynamic allocation ranges, which may 
add cost.  (Right now, the DPCPU allocator relies on relatively static 
allocations over time).  This means that, for now, we may also ignore that 
issue and leave interface counters as-is.  This is probably a good idea 
because we also need to deal with multi-queue interfaces better, and perhaps 
the stats should be per-queue rather than per-ifnet, which may itself help 
address the cache line issue.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 20 13:19:40 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 78A1C106568B
	for <freebsd-arch@freebsd.org>; Sun, 20 Dec 2009 13:19:40 +0000 (UTC)
	(envelope-from Hartmut.Brandt@dlr.de)
Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32])
	by mx1.freebsd.org (Postfix) with ESMTP id 06E518FC18
	for <freebsd-arch@freebsd.org>; Sun, 20 Dec 2009 13:19:39 +0000 (UTC)
Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over
	TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); 
	Sun, 20 Dec 2009 14:19:38 +0100
Date: Sun, 20 Dec 2009 14:19:34 +0100 (CET)
From: Harti Brandt <hartmut.brandt@dlr.de>
X-X-Sender: brandt_h@beagle.kn.op.dlr.de
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <alpine.BSF.2.00.0912201202520.73550@fledge.watson.org>
Message-ID: <20091220134738.V46221@beagle.kn.op.dlr.de>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
	<20091219164818.L1741@beagle.kn.op.dlr.de>
	<alpine.BSF.2.00.0912201202520.73550@fledge.watson.org>
X-OpenPGP-Key: harti@freebsd.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-OriginalArrivalTime: 20 Dec 2009 13:19:38.0085 (UTC)
	FILETIME=[14A3C950:01CA8177]
Cc: =?ISO-8859-15?Q?Ulrich_Sp=F6rlein?= <uqs@spoerlein.net>,
	Hans Petter Selasky <hselasky@c2i.net>, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Harti Brandt <harti@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Dec 2009 13:19:40 -0000

On Sun, 20 Dec 2009, Robert Watson wrote:

RW>
RW>On Sat, 19 Dec 2009, Harti Brandt wrote:
RW>
RW>> To be honest, I'm lost now. Couldn't we just use the largest atomic type
RW>> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle
RW>> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel
RW>> thread?
RW>> 
RW>> Are the 5-6 atomic operations really that costly given the many operations
RW>> done on an IP packet? Are they more costly than a heavyweight sync for each
RW>> ++ or +=?
RW>
RW>Frequent writes to the same cache line across multiple cores are remarkably
RW>expensive, as they trigger the cache coherency protocol (mileage may vary).
RW>For example, a single non-atomically incremented counter cut performance of
RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system
RW>calls were made across all cores.  On many current systems, the cost of an
RW>"atomic" operation is now fairly reasonable as long as the cache line is held
RW>exclusively by the current CPU.  However, if we can avoid them that has
RW>value, as we update quite a few global stats on the way through the network
RW>stack.

Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP 
packet. I would expect, that a single increment is a good percentage of 
the entire processing (in terms of numbers of operations) for 
gettimeofday(), while in IP forwarding this is somewhere in the noise 
floor. In the simples case the packet is acted upon by the receiving 
driver, the IP input function, the IP output function and the sending 
driver. Not talking about IP filters, firewalls, tunnels, dummynet and 
what else. The relative cost of the increment should be much less. But, I 
may be wrong of course.

RW>
RW>> Or we could use the PCPU stuff, use just ++ and += for modifying the
RW>> statistics (32bit) and do the 32->64 bit stuff for all platforms with a
RW>> kernel thread per CPU (do we have this?). Between that thread and the
RW>> sysctl we could use a heavy sync.
RW>
RW>The current short-term plan is to move do this but without a syncer thread:
RW>we'll just aggregate the results when they need to be reported, in the sysctl
RW>path.  How best to scale to 64-bit counters is an interesting question, but
RW>one we can address after per-CPU stats are in place, which address an
RW>immediate performance (rather than statistics accuracy) concern.

Well, the user side of our statistics is in a very bad shape and I have 
problems in handling this in the SNMP daemon. Just a number of examples:

interface statistics:
  - they use u_long, so are either 32-bit or 64-bit depending on the 
    platform
  - a number of required statistics is missing
  - send drops are somewhere else and are 'int'
  - statistics are embedded into struct ifnet (bad for ABI stability) and
    not versioned
  - accessed together with other unrelated information via sysctl()

IPv4 statistics:
  - also u_long (hence different size on the platforms)
  - a lot of fields required by SNMP is missing
  - not versioned
  - accessed via sysctl()
  - per interface statistics totally missing

IPv6 statistics:
  - u_quad_t! so they are suspect to race conditions on 32-bit platforms 
    and, maybe?, on 64-bit platforms
  - a lot of fields requred by SNMP is missing
  - not versioned
  - accessed via sysctl(); per interface statistics via ioctl()

Ethernet statistics:
  - u_long
  - some fields missing
  - implemented in only 3! drivers; some drivers use the corresponding 
    field for something else
  - not versioned

I think, TCP and UDP statistics are equally bad shaped.

I would really like to sort that out before any kind of ABI freeze 
happens. Ideally all the statistics would be accessible per sysctl(), have 
a version number and have all or most of the required statistics with a 
simple way to add new fields without breaking anything. Also the field 
sizes (64 vs. 32 bit) should be correct on the kernel - user interface.

My current feeling after reading this thread is that the low-level kernel 
side stuff is probably out of what I could do with the time I have and 
would sidetrack me too far from the work on bsnmp. What I would like to do 
is to fix the kernel/user interface and let the people that now how to do 
it handle the low-level side.

I would really not like to have to deal with a changing user/kernel 
interface in current if we go in several steps with the kernel stuff.

RW>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
RW>> machine and do routing at link speed, though. This might overflow the IP
RW>> input/output byte counter (which we don't have yet) too fast.
RW>
RW>For byte counters, assuming one 10gbps stream, a 32-bit counter wraps in
RW>about three seconds.  Systems processing 40gbps a second are now quite
RW>realistic, although typically workloads of that sort will be distributed over
RW>16+ cores and using multiple 10gbps NICs.
RW>
RW>My thinking is that we get the switch to per-CPU stats done in 9.x in the
RW>next month sometime, and also get it merged to 8.x a month or so later (I
RW>merged the wrapper macros necessary to do that before 8.0 but didn't have
RW>time to fully evaluate the performance implications of the implementation
RW>switch).

I will try to come up with a patch for the kernel/user interface in the 
mean time. This will be for 9.x only, obviously.

RW>There are two known areas of problem here:
RW>
RW>(1) The cross-product issue with virtual network stacks
RW>(2) The cross-product issue with network interfaces for per-interface stats
RW>
RW>I propose to ignore (1) for now by simply having only vnet0 use per-CPU
RW>stats, and other vnets use single-instance per-vnet stats.  We can solve the
RW>larger problem there at a future date.

This sounds reasonable if we wrap all the statistics stuff into macros 
and/or functions.

RW>I don't have a good proposal for (2) -- the answer may be using DPCPU memory,
RW>but that will require use to support more dynamic allocation ranges, which
RW>may add cost.  (Right now, the DPCPU allocator relies on relatively static
RW>allocations over time).  This means that, for now, we may also ignore that
RW>issue and leave interface counters as-is.  This is probably a good idea
RW>because we also need to deal with multi-queue interfaces better, and perhaps
RW>the stats should be per-queue rather than per-ifnet, which may itself help
RW>address the cache line issue.

Doesn't this help for output only? For the input statistics there still 
will be per-ifnet statistics.

An interesting question from the SNMP point of view is, what happens to 
the statistics if you move around interfaces between vimages. In any case 
it would be good if we could abstract from all the complications while 
going kernel->userland.

harti

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 20 13:47:03 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 964F1106566B;
	Sun, 20 Dec 2009 13:47:03 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 6A21D8FC13;
	Sun, 20 Dec 2009 13:47:03 +0000 (UTC)
Received: from [192.168.2.102] (host86-146-40-97.range86-146.btcentralplus.com
	[86.146.40.97])
	by cyrus.watson.org (Postfix) with ESMTPSA id 87FC446B2D;
	Sun, 20 Dec 2009 08:47:01 -0500 (EST)
Mime-Version: 1.0 (Apple Message framework v1077)
Content-Type: text/plain; charset=us-ascii
From: "Robert N. M. Watson" <rwatson@FreeBSD.org>
In-Reply-To: <20091220134738.V46221@beagle.kn.op.dlr.de>
Date: Sun, 20 Dec 2009 13:46:54 +0000
Content-Transfer-Encoding: quoted-printable
Message-Id: <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
	<20091219164818.L1741@beagle.kn.op.dlr.de>
	<alpine.BSF.2.00.0912201202520.73550@fledge.watson.org>
	<20091220134738.V46221@beagle.kn.op.dlr.de>
To: Harti Brandt <harti@freebsd.org>
X-Mailer: Apple Mail (2.1077)
Cc: =?iso-8859-1?Q?Ulrich_Sp=F6rlein?= <uqs@spoerlein.net>,
	Hans Petter Selasky <hselasky@c2i.net>, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Dec 2009 13:47:03 -0000


On 20 Dec 2009, at 13:19, Harti Brandt wrote:

> RW>Frequent writes to the same cache line across multiple cores are =
remarkably
> RW>expensive, as they trigger the cache coherency protocol (mileage =
may vary).
> RW>For example, a single non-atomically incremented counter cut =
performance of
> RW>gettimeofday() to 1/6th performance on an 8-core system when =
parallel system
> RW>calls were made across all cores.  On many current systems, the =
cost of an
> RW>"atomic" operation is now fairly reasonable as long as the cache =
line is held
> RW>exclusively by the current CPU.  However, if we can avoid them that =
has
> RW>value, as we update quite a few global stats on the way through the =
network
> RW>stack.
>=20
> Hmm. I'm not sure that gettimeofday() is comparable to forwarding an =
IP=20
> packet. I would expect, that a single increment is a good percentage =
of=20
> the entire processing (in terms of numbers of operations) for=20
> gettimeofday(), while in IP forwarding this is somewhere in the noise=20=

> floor. In the simples case the packet is acted upon by the receiving=20=

> driver, the IP input function, the IP output function and the sending=20=

> driver. Not talking about IP filters, firewalls, tunnels, dummynet and=20=

> what else. The relative cost of the increment should be much less. =
But, I=20
> may be wrong of course.

If processing is occurring on multiple CPUs -- for example, you are =
receiving UDP from two ithreads -- then 4-8 cache lines being contended =
due to stats is a lot. Our goal should be (for 9.0) to avoid having any =
contended cache lines in the common case when processing independent =
streams on different CPUs.

> I would really like to sort that out before any kind of ABI freeze=20
> happens. Ideally all the statistics would be accessible per sysctl(), =
have=20
> a version number and have all or most of the required statistics with =
a=20
> simple way to add new fields without breaking anything. Also the field=20=

> sizes (64 vs. 32 bit) should be correct on the kernel - user =
interface.
>=20
> My current feeling after reading this thread is that the low-level =
kernel=20
> side stuff is probably out of what I could do with the time I have and=20=

> would sidetrack me too far from the work on bsnmp. What I would like =
to do=20
> is to fix the kernel/user interface and let the people that now how to =
do=20
> it handle the low-level side.
>=20
> I would really not like to have to deal with a changing user/kernel=20
> interface in current if we go in several steps with the kernel stuff.

I think we should treat the statistics gathering and statistics =
reporting interfaces as entirely separable problems. Statistics are =
updated far more frequently than they are queried, so making the query =
process a bit more expensive (reformatting from an efficient 'update' =
format to an application-friendly 'report' format) should be fine.

One question to think about is whether or not simply cross-CPU summaries =
are sufficient, or whether we actually also want to be able to directly =
monitor per-CPU statistics at the IP layer. The former would maintain =
the status quo making per-CPU behavior purely part of the 'update' step; =
the latter would change the 'report' format as well. I've been focused =
primarily on 'update', but at least for my work it would be quite =
helpful to have per-CPU stats in the 'report' format as well.

> I will try to come up with a patch for the kernel/user interface in =
the=20
> mean time. This will be for 9.x only, obviously.

Sounds good -- and the kernel stats capture can "grow into" the full =
report format as it matures.

> Doesn't this help for output only? For the input statistics there =
still=20
> will be per-ifnet statistics.

Most ifnet-layer stats should really be per-queue, both for input and =
output, which may help.

> An interesting question from the SNMP point of view is, what happens =
to=20
> the statistics if you move around interfaces between vimages. In any =
case=20
> it would be good if we could abstract from all the complications while=20=

> going kernel->userland.

At least for now, the interface is effectively recreated when it moves =
vimage, and only the current vimage is able to monitor it. That could be =
considered a bug but it might also be a simplifying assumption or even a =
feature. Likewise, it's worth remembering that the ifnet index space is =
per-vimage.

Robert=

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 20 14:18:15 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 979981065694;
	Sun, 20 Dec 2009 14:18:15 +0000 (UTC)
	(envelope-from Hartmut.Brandt@dlr.de)
Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32])
	by mx1.freebsd.org (Postfix) with ESMTP id 236238FC25;
	Sun, 20 Dec 2009 14:18:14 +0000 (UTC)
Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over
	TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); 
	Sun, 20 Dec 2009 15:18:13 +0100
Date: Sun, 20 Dec 2009 15:18:11 +0100 (CET)
From: Harti Brandt <hartmut.brandt@dlr.de>
X-X-Sender: brandt_h@beagle.kn.op.dlr.de
To: "Robert N. M. Watson" <rwatson@FreeBSD.org>
In-Reply-To: <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org>
Message-ID: <20091220150003.W54492@beagle.kn.op.dlr.de>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
	<20091219164818.L1741@beagle.kn.op.dlr.de>
	<alpine.BSF.2.00.0912201202520.73550@fledge.watson.org>
	<20091220134738.V46221@beagle.kn.op.dlr.de>
	<5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org>
X-OpenPGP-Key: harti@freebsd.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-OriginalArrivalTime: 20 Dec 2009 14:18:13.0891 (UTC)
	FILETIME=[44393530:01CA817F]
Cc: =?iso-8859-1?Q?Ulrich_Sp=F6rlein?= <uqs@spoerlein.net>,
	Hans Petter Selasky <hselasky@c2i.net>, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Harti Brandt <harti@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Dec 2009 14:18:15 -0000

On Sun, 20 Dec 2009, Robert N. M. Watson wrote:

RNMW>
RNMW>On 20 Dec 2009, at 13:19, Harti Brandt wrote:
RNMW>
RNMW>> RW>Frequent writes to the same cache line across multiple cores are remarkably
RNMW>> RW>expensive, as they trigger the cache coherency protocol (mileage may vary).
RNMW>> RW>For example, a single non-atomically incremented counter cut performance of
RNMW>> RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system
RNMW>> RW>calls were made across all cores.  On many current systems, the cost of an
RNMW>> RW>"atomic" operation is now fairly reasonable as long as the cache line is held
RNMW>> RW>exclusively by the current CPU.  However, if we can avoid them that has
RNMW>> RW>value, as we update quite a few global stats on the way through the network
RNMW>> RW>stack.
RNMW>> 
RNMW>> Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP 
RNMW>> packet. I would expect, that a single increment is a good percentage of 
RNMW>> the entire processing (in terms of numbers of operations) for 
RNMW>> gettimeofday(), while in IP forwarding this is somewhere in the noise 
RNMW>> floor. In the simples case the packet is acted upon by the receiving 
RNMW>> driver, the IP input function, the IP output function and the sending 
RNMW>> driver. Not talking about IP filters, firewalls, tunnels, dummynet and 
RNMW>> what else. The relative cost of the increment should be much less. But, I 
RNMW>> may be wrong of course.
RNMW>
RNMW>If processing is occurring on multiple CPUs -- for example, you are receiving UDP from two ithreads -- then 4-8 cache lines being contended due to stats is a lot. Our goal should be (for 9.0) to avoid having any contended cache lines in the common case when processing independent streams on different CPUs.
RNMW>
RNMW>> I would really like to sort that out before any kind of ABI freeze 
RNMW>> happens. Ideally all the statistics would be accessible per sysctl(), have 
RNMW>> a version number and have all or most of the required statistics with a 
RNMW>> simple way to add new fields without breaking anything. Also the field 
RNMW>> sizes (64 vs. 32 bit) should be correct on the kernel - user interface.
RNMW>> 
RNMW>> My current feeling after reading this thread is that the low-level kernel 
RNMW>> side stuff is probably out of what I could do with the time I have and 
RNMW>> would sidetrack me too far from the work on bsnmp. What I would like to do 
RNMW>> is to fix the kernel/user interface and let the people that now how to do 
RNMW>> it handle the low-level side.
RNMW>> 
RNMW>> I would really not like to have to deal with a changing user/kernel 
RNMW>> interface in current if we go in several steps with the kernel stuff.
RNMW>
RNMW>I think we should treat the statistics gathering and statistics 
RNMW>reporting interfaces as entirely separable problems. Statistics are 
RNMW>updated far more frequently than they are queried, so making the 
RNMW>query process a bit more expensive (reformatting from an efficient 
RNMW>'update' format to an application-friendly 'report' format) should be 
RNMW>fine.
RNMW>
RNMW>One question to think about is whether or not simply cross-CPU 
RNMW>summaries are sufficient, or whether we actually also want to be able 
RNMW>to directly monitor per-CPU statistics at the IP layer. The former 
RNMW>would maintain the status quo making per-CPU behavior purely part of 
RNMW>the 'update' step; the latter would change the 'report' format as 
RNMW>well. I've been focused primarily on 'update', but at least for my 
RNMW>work it would be quite helpful to have per-CPU stats in the 'report' 
RNMW>format as well.

No problem. I can even add that in a private SNMP MIB if it seems useful.

RNMW>
RNMW>> I will try to come up with a patch for the kernel/user interface in the 
RNMW>> mean time. This will be for 9.x only, obviously.
RNMW>
RNMW>Sounds good -- and the kernel stats capture can "grow into" the full 
RNMW>report format as it matures.
RNMW>
RNMW>> Doesn't this help for output only? For the input statistics there still 
RNMW>> will be per-ifnet statistics.
RNMW>
RNMW>Most ifnet-layer stats should really be per-queue, both for input and 
RNMW>output, which may help.

As far as I can see currently the driver just calls if_input which is the 
interface dependend input function. There seems to be no 
driver-independent abstraction of input queues. (The hatm driver I wrote 
several years ago has to input queues in hardware corresponding to 4 (or 
8?) interrupt queues, but somewhere in the driver you put all of this 
through the single if_input hook). Or is there something I'm missing?

RNMW>> An interesting question from the SNMP point of view is, what happens to 
RNMW>> the statistics if you move around interfaces between vimages. In any case 
RNMW>> it would be good if we could abstract from all the complications while 
RNMW>> going kernel->userland.
RNMW>
RNMW>At least for now, the interface is effectively recreated when it 
RNMW>moves vimage, and only the current vimage is able to monitor it. That 
RNMW>could be considered a bug but it might also be a simplifying 
RNMW>assumption or even a feature. Likewise, it's worth remembering that 
RNMW>the ifnet index space is per-vimage.

I was already thinking about how to fit the vimage stuff into the SNMP 
model. The simplest way is to run one SNMP daemon per vimage. Next comes 
having one daemon that has one context per vimage. Bsnmpd does its own 
mapping of system ifnet indexes to SNMP interface indexes, because the 
allocation of system ifnet indexes does not fit to the RFC requirements. 
This means it will detect when an interface is moved away from a vimage 
and comes back later. If the kernel statistics are stable over these 
movements, there is no need to declare a counter discontinuity via SNMP.
On the other hand these operations are probably seldom enough ...

harti 

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 20 14:35:27 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7B22D1065693;
	Sun, 20 Dec 2009 14:35:27 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 4E6D88FC08;
	Sun, 20 Dec 2009 14:35:27 +0000 (UTC)
Received: from fledge.watson.org (fledge.watson.org [65.122.17.41])
	by cyrus.watson.org (Postfix) with ESMTPS id 019E046B1A;
	Sun, 20 Dec 2009 09:35:27 -0500 (EST)
Date: Sun, 20 Dec 2009 14:35:26 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Harti Brandt <harti@freebsd.org>
In-Reply-To: <20091220150003.W54492@beagle.kn.op.dlr.de>
Message-ID: <alpine.BSF.2.00.0912201432260.73550@fledge.watson.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
	<20091219164818.L1741@beagle.kn.op.dlr.de>
	<alpine.BSF.2.00.0912201202520.73550@fledge.watson.org>
	<20091220134738.V46221@beagle.kn.op.dlr.de>
	<5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org>
	<20091220150003.W54492@beagle.kn.op.dlr.de>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: =?ISO-8859-15?Q?Ulrich_Sp=F6rlein?= <uqs@spoerlein.net>,
	Hans Petter Selasky <hselasky@c2i.net>, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Dec 2009 14:35:27 -0000


On Sun, 20 Dec 2009, Harti Brandt wrote:

> RNMW>> I will try to come up with a patch for the kernel/user interface in the
> RNMW>> mean time. This will be for 9.x only, obviously.
> RNMW>
> RNMW>Sounds good -- and the kernel stats capture can "grow into" the full
> RNMW>report format as it matures.
> RNMW>
> RNMW>> Doesn't this help for output only? For the input statistics there still
> RNMW>> will be per-ifnet statistics.
> RNMW>
> RNMW>Most ifnet-layer stats should really be per-queue, both for input and
> RNMW>output, which may help.
>
> As far as I can see currently the driver just calls if_input which is the 
> interface dependend input function. There seems to be no driver-independent 
> abstraction of input queues. (The hatm driver I wrote several years ago has 
> to input queues in hardware corresponding to 4 (or 8?) interrupt queues, but 
> somewhere in the driver you put all of this through the single if_input 
> hook). Or is there something I'm missing?

You're not missing anything, it's the code for what I describe that's missing 
:-).  Adding an ifnet-layer abstration for input and output queues (if only to 
hold stats in a cross-driver way) is something that's come up at the last 
devsummit or two, and something we need to sort out for 9.0.

> I was already thinking about how to fit the vimage stuff into the SNMP 
> model. The simplest way is to run one SNMP daemon per vimage. Next comes 
> having one daemon that has one context per vimage. Bsnmpd does its own 
> mapping of system ifnet indexes to SNMP interface indexes, because the 
> allocation of system ifnet indexes does not fit to the RFC requirements. 
> This means it will detect when an interface is moved away from a vimage and 
> comes back later. If the kernel statistics are stable over these movements, 
> there is no need to declare a counter discontinuity via SNMP. On the other 
> hand these operations are probably seldom enough ...

For a system with thousands of virtual network stacks, it would be nice to 
avoid requiring one process/vimage, but given the way we currently link 
processes to vimages, arranging that is currently awkward.  We're also well 
past the point where a 16-bit integer can describe what is required out of our 
interface system; perhaps we just bite the bullet and roll to 32-bit if 
indexes, but also give each interface a uuid (or the like) that is stable over 
vimage moves.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 20 20:18:01 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 94999106568B;
	Sun, 20 Dec 2009 20:18:01 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au
	[211.29.132.191])
	by mx1.freebsd.org (Postfix) with ESMTP id 2ACC48FC14;
	Sun, 20 Dec 2009 20:18:00 +0000 (UTC)
Received: from c211-30-197-33.carlnfd3.nsw.optusnet.com.au
	(c211-30-197-33.carlnfd3.nsw.optusnet.com.au [211.30.197.33])
	by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	nBKKHu6Y004453
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Mon, 21 Dec 2009 07:17:58 +1100
Date: Mon, 21 Dec 2009 07:17:56 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Harti Brandt <harti@freebsd.org>
In-Reply-To: <20091219204217.D1741@beagle.kn.op.dlr.de>
Message-ID: <20091221063312.R39703@delplex.bde.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
	<20091219164818.L1741@beagle.kn.op.dlr.de>
	<20091220032452.W2429@besplex.bde.org>
	<20091219204217.D1741@beagle.kn.op.dlr.de>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Dec 2009 20:18:01 -0000

On Sat, 19 Dec 2009, Harti Brandt wrote:

> On Sun, 20 Dec 2009, Bruce Evans wrote:

> BE>...
> BE>I don't see why using atomic or locks for just the 64 bit counters is good.
> BE>We will probably end up with too many 64-bit counters, especially if they
> BE>don't cost much when not read.
>
> On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is
> modifying it, the read will probably be always correct given the variable is
> correctly aligned.

We assume this.  Otherwise the per-CPU counter optimization wouldn't work
so well.

> On a 64-bit arch when reading a 64-bit value on one CPU
> while the other one is adding to, do I always get the correct value? I'm
> not sure about this, why I put atomic_*() there assuming that they will make
> this correct.

You have to use the PCPU_INC()/PCPU_GET() interface and not worry about
this.  It must supply any necessary atomicness on an arch-specific
basis.

> The idea is (for 32-bit platforms):
>
> struct pcpu_stats {
> 	uint32_t in_bytes;
> 	uint32_t in_packets;
> };
>
> struct pcpu_hc_stats {
> 	uint64_t hc_in_bytes;
> 	uint64_t hc_in_packets;
> };
>
> /* driver; IP stack; ... */
> ...
> pcpu_stats->in_bytes += bytes;
> pcpu_stats->in_packets++;
> ...
>
> /* per CPU kernel thread for 32-bit arch */
> lock(pcpu_hc_stats);
> ...
> val = pcpu_stats->in_bytes;
> if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val)
> 	pcpu_hc_stats->in_bytes += 0x100000000;
> pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes &
>    0xffffffff00000000ULL) | val;

Why not just `pcpu_hc_stats->in_bytes |= val' for the second statement?

> ...
> unlock(pcpu_hc_stats);

> BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
> BE>> machine and do routing at link speed, though. This might overflow the IP
> BE>> input/output byte counter (which we don't have yet) too fast.
> BE>
> BE>Not with a mere 10GB/S.  That's ~1GB/S so it takes 4 seconds to overflow
> BE>a 32-bit byte counter.  A bit counter would take a while to overflow too.
> BE>Are there any faster incrementors?  TSCs also take O(1) seconds to overflow,
> BE>and timecounter logic depends on no timecounter overflowing much faster
> BE>than that.
>
> If you have 4 10GBit/s adapters each operating full-duplex at link speed you
> wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels
> where each packet counts several times. But I suppose this will be not so easy
> with IA32 to implement :-)

I was only thinking of per-interface counters, but we can handle the
64-bit fixup for thousands of these if we can handle thousands of fixup
threads each running thousands of times per second.  Actually, thousands
of adaptors would need thousands of CPUS to handle and each per-CPU
counter would be limited by what 1 CPU could handle so we're back nearer
to the original 4 seconds than to 4/4000.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Mon Dec 21 11:06:51 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3753910656A4
	for <freebsd-arch@FreeBSD.org>; Mon, 21 Dec 2009 11:06:51 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 0C7BD8FC1A
	for <freebsd-arch@FreeBSD.org>; Mon, 21 Dec 2009 11:06:51 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id nBLB6oWA004017
	for <freebsd-arch@FreeBSD.org>; Mon, 21 Dec 2009 11:06:50 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.14.3/8.14.3/Submit) id nBLB6olq004015
	for freebsd-arch@FreeBSD.org; Mon, 21 Dec 2009 11:06:50 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 21 Dec 2009 11:06:50 GMT
Message-Id: <200912211106.nBLB6olq004015@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
	owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@FreeBSD.org>
To: freebsd-arch@FreeBSD.org
Cc: 
Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 21 Dec 2009 11:06:51 -0000

Note: to view an individual PR, use:
  http://www.freebsd.org/cgi/query-pr.cgi?pr=(number).

The following is a listing of current problems submitted by FreeBSD users.
These represent problem reports covering all versions including
experimental development code and obsolete releases.


S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/120749  arch       [request] Suggest upping the default kern.ps_arg_cache

1 problem total.


From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 25 10:58:50 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8F1251065696;
	Fri, 25 Dec 2009 10:58:50 +0000 (UTC)
	(envelope-from mavbsd@gmail.com)
Received: from mail-fx0-f227.google.com (mail-fx0-f227.google.com
	[209.85.220.227])
	by mx1.freebsd.org (Postfix) with ESMTP id EE61D8FC19;
	Fri, 25 Dec 2009 10:58:49 +0000 (UTC)
Received: by fxm27 with SMTP id 27so8778359fxm.3
	for <multiple recipients>; Fri, 25 Dec 2009 02:58:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:sender:message-id:date:from
	:user-agent:mime-version:to:subject:x-enigmail-version:content-type
	:content-transfer-encoding;
	bh=dOGAq0fldWTFVfl3MvbKRfDkuvsbY7wkE6GaZL8ERHg=;
	b=dwcbHs0bSEmbDz+AsLd/7eq7p7R52mMaMOb6EBkh3RtYthc+vC5HX6XUk1rkv5RhpR
	f6AioFXme82Mr/Pgy6tMkSL91YTtb+xagYB13AdzswxQZjxdzH8ZRZAkNmRnozEoY2vA
	G7vt+wSXnSJNJ03sX9mhshUXEj2WPGLYU9CnA=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=sender:message-id:date:from:user-agent:mime-version:to:subject
	:x-enigmail-version:content-type:content-transfer-encoding;
	b=omM1nlgF+hviwCCxtgDgzXupprgNaE+hw+AGdFQC52Ezh82cmYYgnU+UNNvaDR2laB
	cqK+pYjJbm1YLpWq0i5Mm9HcSnIRwpgMlMS45ByFAVTMprgs/xx+O2UOg28QxXFu7fQm
	2rvEfJMFyYleL0umo6Udyw4nFJAIx69SQdhoE=
Received: by 10.223.95.72 with SMTP id c8mr6014885fan.73.1261738728801;
	Fri, 25 Dec 2009 02:58:48 -0800 (PST)
Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226])
	by mx.google.com with ESMTPS id 16sm3145902fxm.8.2009.12.25.02.58.47
	(version=SSLv3 cipher=RC4-MD5); Fri, 25 Dec 2009 02:58:48 -0800 (PST)
Sender: Alexander Motin <mavbsd@gmail.com>
Message-ID: <4B349ABF.2070800@FreeBSD.org>
Date: Fri, 25 Dec 2009 12:58:07 +0200
From: Alexander Motin <mav@FreeBSD.org>
User-Agent: Thunderbird 2.0.0.23 (X11/20090901)
MIME-Version: 1.0
To: freebsd-arch@freebsd.org, 
 FreeBSD-Current <freebsd-current@freebsd.org>
X-Enigmail-Version: 0.96.0
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: 7bit
Cc: 
Subject: File system blocks alignment
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Dec 2009 10:58:50 -0000

Hi.

Recently WD released first series of ATA disks with increased physical
sector size. It makes writes not matching with 4K blocks inefficient
there. So I propose to get back to the question of optimal FS block
alignment. This topic is also important for most of RAIDs having striped
nature, such as RAID0/3/5/... and flash drives with simple controller
(such as MMC/SD cards).

As I have no one of those WD disks yet, I have made series of tests with
RAID0, made by geom_stripe, to check general idea. I've tested the most
describing case: 2-disk RAID0 with 16K stripe, 16K FS block and many 16K
random I/Os (reads in this test, to avoid FS locking). Same load pattern
 but with writes I had on my busy disk-bound MySQL servers, so it is
quite real.

Test one, default partitioning.

%gstripe label -s 16384 data /dev/ada1 /dev/ada2
%fdisk -I /dev/stripe/data
%disklabel -w /dev/stripe/datas1
%disklabel /dev/stripe/datas1

# /dev/stripe/datas1:

8 partitions:

#        size   offset    fstype   [fsize bsize bps/cpg]

  a: 1250274611       16    unused        0     0

  c: 1250274627        0    unused        0     0         # "raw" part,
don't edit
%diskinfo -v /dev/stripe/datas1a

/dev/stripe/datas1a

        512             # sectorsize

        640140600832    # mediasize in bytes (596G)

        1250274611      # mediasize in sectors

        16384           # stripesize

        7680            # stripeoffset

        77825           # Cylinders according to firmware.

        255             # Heads according to firmware.

        63              # Sectors according to firmware.

As you can see, fdisk aligned partition to the "track length" of 63
sectors and disklabel added offset of 16 sectors. As result, file system
will start at quite odd place of the RAID stripe.
I've created UFS file system, pre-wrote 4GB file and run tests (raidtest
was patched to generate only 16K requests):
%raidtest test -d /mnt/qqq -n 1
Requests per second: 112
%raidtest test -d /mnt/qqq -n 64
Requests per second: 314
Before each test FS was unmounted to flush caches.

Test two, FS manually aligned with disklabel.
%disklabel /dev/stripe/datas1
# /dev/stripe/datas1:
8 partitions:
#        size   offset    fstype   [fsize bsize bps/cpg]
  a: 1250274578       33    unused        0     0
  c: 1250274627        0    unused        0     0         # "raw" part,
don't edit
%diskinfo -v /dev/stripe/datas1a
/dev/stripe/datas1a
        512             # sectorsize
        640140583936    # mediasize in bytes (596G)
        1250274578      # mediasize in sectors
        16384           # stripesize
        0               # stripeoffset
        77825           # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
File system aligned with stripe.
%raidtest test -d /mnt/qqq -n 1

Requests per second: 133

%raidtest test -d /mnt/qqq -n 64

Requests per second: 594


The difference is quite significant. Unaligned RAID0 access causes two
disks involved in it's handling, while aligned one leaves one of disks
free for another request, doubling performance.

As we have now mechanism for reporting stripe size and offset for any
partition to user-level, it should be easy to make disk partitioning and
file system creation tools to use it automatically.

Stripe size/offset reporting now supported by ada and mmcsd disk drivers
and most of GEOM modules. It would be nice to fetch that info from
hardware RAIDs also, where possible.

-- 
Alexander Motin

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 25 11:27:43 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0EA2A10656A4;
	Fri, 25 Dec 2009 11:27:43 +0000 (UTC)
	(envelope-from mavbsd@gmail.com)
Received: from mail-fx0-f227.google.com (mail-fx0-f227.google.com
	[209.85.220.227])
	by mx1.freebsd.org (Postfix) with ESMTP id 6AA618FC1B;
	Fri, 25 Dec 2009 11:27:42 +0000 (UTC)
Received: by fxm27 with SMTP id 27so8788129fxm.3
	for <multiple recipients>; Fri, 25 Dec 2009 03:27:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:sender:message-id:date:from
	:user-agent:mime-version:to:cc:subject:references:in-reply-to
	:x-enigmail-version:content-type:content-transfer-encoding;
	bh=wkz+a8gEXMEDZkF74xrcCKTqK2ep/dI9sOSVAH0mAs8=;
	b=JWU99z8Z6Q+WNtUvChyGlbjrkNoEHbLPbgiPjKFUGuMOllANhI7SoKq18kbX6gG6pz
	b8dXdzPU0PZtkZf4r3DAEJAR0hYwdXeuNLsmYhELaAPV2/K9OgdpwTGMiLhBFZ7gjXU8
	i66kRYzD2MA/AeDHQjGwIL/Ri30mL+IErhGm0=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
	:references:in-reply-to:x-enigmail-version:content-type
	:content-transfer-encoding;
	b=Rd28odWfAhYh6BHS5lFK5sW841UQcOzXPCu6a255Vsvq9ECc7dZsOvJUKNGsd+HgEL
	M9X5NEuCs+kCkyOERr9GwppzMvY8+LhjwfXelzis/eJt+FHSfexxnJ90WrM2t5ByrCvL
	Axujpk5+JZPjLOBs7HwTAm2+ISeq/ewsxJn2w=
Received: by 10.223.14.150 with SMTP id g22mr10842718faa.14.1261740461308;
	Fri, 25 Dec 2009 03:27:41 -0800 (PST)
Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226])
	by mx.google.com with ESMTPS id 14sm3159895fxm.7.2009.12.25.03.27.39
	(version=SSLv3 cipher=RC4-MD5); Fri, 25 Dec 2009 03:27:40 -0800 (PST)
Sender: Alexander Motin <mavbsd@gmail.com>
Message-ID: <4B34A183.7000909@FreeBSD.org>
Date: Fri, 25 Dec 2009 13:26:59 +0200
From: Alexander Motin <mav@FreeBSD.org>
User-Agent: Thunderbird 2.0.0.23 (X11/20090901)
MIME-Version: 1.0
To: Thomas Backman <serenity@exscape.org>
References: <4B349ABF.2070800@FreeBSD.org>
	<469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org>
In-Reply-To: <469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org>
X-Enigmail-Version: 0.96.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: FreeBSD-Current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: File system blocks alignment
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Dec 2009 11:27:43 -0000

Thomas Backman wrote:
> On Dec 25, 2009, at 11:58 AM, Alexander Motin wrote:
|>> Recently WD released first series of ATA disks with increased physical
>> sector size. It makes writes not matching with 4K blocks inefficient
>> there.
> They don't expose this to the OS, though (not by default, anyway), but chop it up into 8 512-byte sectors for compatibility reasons.
> Just thought I'd point that out - I'm not even sure if you can get them to *not* do the compatibility thing and expose 4k-sized sectors.

Latest ATA-8 specification allows drive to report both logical (512B)
and physical (4KB) sector sizes. ada driver able to fetch and report
that info to GEOM. If these drives not reporting it yet (are you really
sure?), it is only question of their firmware.

-- 
Alexander Motin

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 25 11:38:09 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A8EAE106566B
	for <freebsd-arch@freebsd.org>; Fri, 25 Dec 2009 11:38:09 +0000 (UTC)
	(envelope-from serenity@exscape.org)
Received: from ch-smtp01.sth.basefarm.net (ch-smtp01.sth.basefarm.net
	[80.76.149.212])
	by mx1.freebsd.org (Postfix) with ESMTP id 653748FC0A
	for <freebsd-arch@freebsd.org>; Fri, 25 Dec 2009 11:38:09 +0000 (UTC)
Received: from c83-253-248-99.bredband.comhem.se ([83.253.248.99]:58571
	helo=mx.exscape.org)
	by ch-smtp01.sth.basefarm.net with esmtp (Exim 4.68)
	(envelope-from <serenity@exscape.org>)
	id 1NO8FN-0007Ak-46; Fri, 25 Dec 2009 12:22:18 +0100
Received: from [192.168.1.5] (macbookpro [192.168.1.5])
	(using TLSv1 with cipher AES128-SHA (128/128 bits))
	(No client certificate requested)
	by mx.exscape.org (Postfix) with ESMTPSA id C0FFE1F57F8;
	Fri, 25 Dec 2009 12:22:14 +0100 (CET)
Mime-Version: 1.0 (Apple Message framework v1077)
Content-Type: text/plain; charset=us-ascii
From: Thomas Backman <serenity@exscape.org>
In-Reply-To: <4B349ABF.2070800@FreeBSD.org>
Date: Fri, 25 Dec 2009 12:22:08 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org>
References: <4B349ABF.2070800@FreeBSD.org>
To: Alexander Motin <mav@freebsd.org>
X-Mailer: Apple Mail (2.1077)
X-Originating-IP: 83.253.248.99
X-Scan-Result: No virus found in message 1NO8FN-0007Ak-46.
X-Scan-Signature: ch-smtp01.sth.basefarm.net 1NO8FN-0007Ak-46
	71b17cad7b06e00831995badaaabf3d8
Cc: FreeBSD-Current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: File system blocks alignment
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Dec 2009 11:38:09 -0000

On Dec 25, 2009, at 11:58 AM, Alexander Motin wrote:

> Hi.
>=20
> Recently WD released first series of ATA disks with increased physical
> sector size. It makes writes not matching with 4K blocks inefficient
> there.
They don't expose this to the OS, though (not by default, anyway), but =
chop it up into 8 512-byte sectors for compatibility reasons.
Just thought I'd point that out - I'm not even sure if you can get them =
to *not* do the compatibility thing and expose 4k-sized sectors.

I'm sure your work is important for other setups, though, as proved. :)

Regards,
Thomas=

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 25 11:39:26 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2054D106566B;
	Fri, 25 Dec 2009 11:39:26 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id D8DA08FC16;
	Fri, 25 Dec 2009 11:39:25 +0000 (UTC)
Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id 9983C7E9A3;
	Fri, 25 Dec 2009 11:39:24 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.3/8.14.3) with ESMTP id nBPBe2S0027358;
	Fri, 25 Dec 2009 11:40:02 GMT (envelope-from phk@critter.freebsd.dk)
To: Alexander Motin <mav@FreeBSD.org>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Fri, 25 Dec 2009 12:58:07 +0200."
	<4B349ABF.2070800@FreeBSD.org> 
Date: Fri, 25 Dec 2009 11:40:02 +0000
Message-ID: <27357.1261741202@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: FreeBSD-Current <freebsd-current@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: File system blocks alignment 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Dec 2009 11:39:26 -0000

In message <4B349ABF.2070800@FreeBSD.org>, Alexander Motin writes:

>The difference is quite significant. Unaligned RAID0 access causes two
>disks involved in it's handling, while aligned one leaves one of disks
>free for another request, doubling performance.

You will find RAID5 writes to be an even better test:  Optimal filesystem
block-size is a RAID5 stripe width, and if you do not get the offset
right you instantly loose at least 50% of your write bandwidth.  My
practical experience says oftem more like 75% is lost.

>As we have now mechanism for reporting stripe size and offset for any
>partition to user-level, it should be easy to make disk partitioning and
>file system creation tools to use it automatically.

For MBR's there are compat requirement worries, slices must be track
aligned for strict compat with (old ?) funky bioses.   BSDlabel
have no such fine details, so that is probably the best place to
align to stripe offsets.

Be aware that stripe-widths may be ridiculously large: you should
not use them as blocksizes, just make sure that blocksizes divide
cleanly into them.

>Stripe size/offset reporting now supported by ada and mmcsd disk drivers
>and most of GEOM modules. It would be nice to fetch that info from
>hardware RAIDs also, where possible.

Indeed.

Good work, keep at it!

Poul-Henning

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 25 11:44:44 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E40C31065670;
	Fri, 25 Dec 2009 11:44:44 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id A80688FC19;
	Fri, 25 Dec 2009 11:44:44 +0000 (UTC)
Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id 9945F7E995;
	Fri, 25 Dec 2009 11:44:43 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.3/8.14.3) with ESMTP id nBPBjLS0027417;
	Fri, 25 Dec 2009 11:45:21 GMT (envelope-from phk@critter.freebsd.dk)
To: Thomas Backman <serenity@exscape.org>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Fri, 25 Dec 2009 12:22:08 +0100."
	<469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org> 
Date: Fri, 25 Dec 2009 11:45:20 +0000
Message-ID: <27416.1261741520@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: Alexander Motin <mav@freebsd.org>,
	FreeBSD-Current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: File system blocks alignment 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Dec 2009 11:44:45 -0000

In message <469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org>, Thomas Backman w
rites:
>On Dec 25, 2009, at 11:58 AM, Alexander Motin wrote:

>They don't expose this to the OS, though (not by default, anyway), but =
>chop it up into 8 512-byte sectors for compatibility reasons.
>Just thought I'd point that out - I'm not even sure if you can get them =
>to *not* do the compatibility thing and expose 4k-sized sectors.

While that is true, it is worth noting that the same Windows-compat
idioty is what doomed the world to RAID5 instead of RAID3.

The recent article in Queue Magazine shows how deeply ingrained the
512byte mindset has become: The author goes to great lengths to
praise RAID6 and higher for their ability to have multiple bit ECC
without ever recognizing (author not knowing ?) that RAID3 has had
this ability from day one.

UFS runs incredibly well on 4k blocks, and we should exploit that
to the fullest extent, and if we really want to jerk chains, we should
push RAID3 in 4+2 and 8+3 configs aggressively, it performs great,
both under read and write, and Windows cannot do it.

Poul-Henning

PS: Merry X-mas everybody!

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 25 14:03:08 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B439010656C1;
	Fri, 25 Dec 2009 14:03:08 +0000 (UTC)
	(envelope-from mavbsd@gmail.com)
Received: from mail-fx0-f227.google.com (mail-fx0-f227.google.com
	[209.85.220.227])
	by mx1.freebsd.org (Postfix) with ESMTP id 1E7898FC28;
	Fri, 25 Dec 2009 14:03:07 +0000 (UTC)
Received: by fxm27 with SMTP id 27so8840276fxm.3
	for <multiple recipients>; Fri, 25 Dec 2009 06:03:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:sender:message-id:date:from
	:user-agent:mime-version:to:cc:subject:references:in-reply-to
	:x-enigmail-version:content-type:content-transfer-encoding;
	bh=R2BJlZNHt3AdKWRU5CjVdwHNBknRyMYX3YkeFXpqZFg=;
	b=PVkmPU+1PctRe78u4jal5YkdiqjvgBzbQqt7gNRmhqrP1vJvWEzs8cjFhyuRZnvIpE
	jzVXYfdZo44LiCwmkKetWPICSZUw9U7/ZJ0Jd2UgW+chClU+6LtFS624z4Wv+l+NOU1v
	pmTaohyXCHS60CEvbEWgMY6rZvhdYX31rlDnU=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
	:references:in-reply-to:x-enigmail-version:content-type
	:content-transfer-encoding;
	b=EgdinLOY749Ydsq4tWqSZI7SixhvO/l5hOp9pC/wD8c6xdE1sg9fv5Oq2CwkHm8E34
	9y/GAj1W0lAuSHVisZY6uDPfitqieBido6kokHAQi3z3utFLC9g3NorozDR8E4aHUIyR
	A0Gm5Zw4iEhmvaztvw36Ease7ivY2rmDpZOxM=
Received: by 10.223.62.11 with SMTP id v11mr4553173fah.60.1261749787066;
	Fri, 25 Dec 2009 06:03:07 -0800 (PST)
Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226])
	by mx.google.com with ESMTPS id 13sm3185428fxm.5.2009.12.25.06.03.05
	(version=SSLv3 cipher=RC4-MD5); Fri, 25 Dec 2009 06:03:06 -0800 (PST)
Sender: Alexander Motin <mavbsd@gmail.com>
Message-ID: <4B34C619.7070505@FreeBSD.org>
Date: Fri, 25 Dec 2009 16:03:05 +0200
From: Alexander Motin <mav@FreeBSD.org>
User-Agent: Thunderbird 2.0.0.23 (X11/20091212)
MIME-Version: 1.0
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
References: <27357.1261741202@critter.freebsd.dk>
In-Reply-To: <27357.1261741202@critter.freebsd.dk>
X-Enigmail-Version: 0.96.0
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: 7bit
Cc: FreeBSD-Current <freebsd-current@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: File system blocks alignment
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Dec 2009 14:03:08 -0000

Poul-Henning Kamp wrote:
> In message <4B349ABF.2070800@FreeBSD.org>, Alexander Motin writes:
>> The difference is quite significant. Unaligned RAID0 access causes two
>> disks involved in it's handling, while aligned one leaves one of disks
>> free for another request, doubling performance.
> 
> You will find RAID5 writes to be an even better test:  Optimal filesystem
> block-size is a RAID5 stripe width, and if you do not get the offset
> right you instantly loose at least 50% of your write bandwidth.  My
> practical experience says oftem more like 75% is lost.

Sure, I just had no trusted RAID5 nearby to do benchmark.

Actually with RAID5 situation is even more complicated, as there are
actually two optimal transaction sizes:
- First is a stripe size - amount of data written sequentially to one
disk. If you are not aligned with it, it give same results as I have
just shown.
- Second is a row size - stripe size * number of data disks. You may
freely read less information then full row, but short write cause RAID
to handle read-modify-write scenario. If you have 3 disks and no battery
backed cache - you will definitely loose. But if there are 15 disks and
good cache, I believe ability to execute multiple requests independently
in parallel will compensate penalty. Also with 15 disks it would
impractical to increase FS block size, as in that case OS will have to
do that read-modify-write instead of controller and you may loose even more.

With RAID5 I think best practice would be to align FS to the stripe size
and instruct it to write data in maximal bursts, in best case - full row
at a time.

-- 
Alexander Motin

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 25 17:30:59 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0381210656A3;
	Fri, 25 Dec 2009 17:30:59 +0000 (UTC)
	(envelope-from gpalmer@freebsd.org)
Received: from noop.in-addr.com (mail.in-addr.com [IPv6:2001:470:8:162::1])
	by mx1.freebsd.org (Postfix) with ESMTP id CF14B8FC1D;
	Fri, 25 Dec 2009 17:30:58 +0000 (UTC)
Received: from gjp by noop.in-addr.com with local (Exim 4.54 (FreeBSD))
	id 1NOE09-000LYr-79; Fri, 25 Dec 2009 12:30:57 -0500
Date: Fri, 25 Dec 2009 12:30:57 -0500
From: Gary Palmer <gpalmer@freebsd.org>
To: Alexander Motin <mav@FreeBSD.org>
Message-ID: <20091225173057.GA75881@in-addr.com>
References: <4B349ABF.2070800@FreeBSD.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4B349ABF.2070800@FreeBSD.org>
Cc: FreeBSD-Current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: File system blocks alignment
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Dec 2009 17:30:59 -0000

On Fri, Dec 25, 2009 at 12:58:07PM +0200, Alexander Motin wrote:
> Hi.
> 
> Recently WD released first series of ATA disks with increased physical
> sector size. It makes writes not matching with 4K blocks inefficient
> there. So I propose to get back to the question of optimal FS block
> alignment. This topic is also important for most of RAIDs having striped
> nature, such as RAID0/3/5/... and flash drives with simple controller
> (such as MMC/SD cards).

This is also a critical issue on certain SAN system.  NetApp, I suspect
as a result of them layering a virtual LUN ontop of another filesystem
(WAFL), is very sensitive to filesystem alignment on the LUN.  If the
I/Os to the LUN are not 4k aligned, performance suffers a serious hit.
I'm not sure what other SAN vendors suffer similar alignment restrictions.

Regards,

Gary

From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 25 18:18:25 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 51948106568F;
	Fri, 25 Dec 2009 18:18:25 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 14C748FC15;
	Fri, 25 Dec 2009 18:18:24 +0000 (UTC)
Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id 229B57E831;
	Fri, 25 Dec 2009 18:18:24 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.3/8.14.3) with ESMTP id nBPIJ1tH028376;
	Fri, 25 Dec 2009 18:19:01 GMT (envelope-from phk@critter.freebsd.dk)
To: Alexander Motin <mav@FreeBSD.org>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Fri, 25 Dec 2009 16:03:05 +0200."
	<4B34C619.7070505@FreeBSD.org> 
Date: Fri, 25 Dec 2009 18:19:01 +0000
Message-ID: <28375.1261765141@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: FreeBSD-Current <freebsd-current@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: File system blocks alignment 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Dec 2009 18:18:25 -0000

In message <4B34C619.7070505@FreeBSD.org>, Alexander Motin writes:
>Poul-Henning Kamp wrote:

>- Second is a row size - stripe size * number of data disks. You may
>freely read less information then full row, but short write cause RAID
>to handle read-modify-write scenario.

There is a far worse scenario: a stripe-spanning write, it forces a
RMW cycle over two different RAID5 stripes.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.