From owner-freebsd-performance@FreeBSD.ORG  Mon Jun 26 07:05:38 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: performance@freebsd.org
Delivered-To: freebsd-performance@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D22C316A406;
	Mon, 26 Jun 2006 07:05:38 +0000 (UTC)
	(envelope-from mv@thebeastie.org)
Received: from p4.roq.com (ns1.ecoms.com [207.44.130.137])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 090C243DA4;
	Mon, 26 Jun 2006 07:05:29 +0000 (GMT)
	(envelope-from mv@thebeastie.org)
Received: from p4.roq.com (localhost.roq.com [127.0.0.1])
	by p4.roq.com (Postfix) with ESMTP id CCCAA4CD95;
	Mon, 26 Jun 2006 07:05:31 +0000 (GMT)
Received: from vaulte.jumbuck.com (ppp166-27.static.internode.on.net
	[150.101.166.27]) by p4.roq.com (Postfix) with ESMTP id 2AEFD4CA34;
	Mon, 26 Jun 2006 07:05:31 +0000 (GMT)
Received: from vaulte.jumbuck.com (localhost [127.0.0.1])
	by vaulte.jumbuck.com (Postfix) with ESMTP id 82FCC8A062;
	Mon, 26 Jun 2006 17:05:27 +1000 (EST)
Received: from [192.168.46.102] (unknown [192.168.46.250])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by vaulte.jumbuck.com (Postfix) with ESMTP id 797F68A01F;
	Mon, 26 Jun 2006 17:05:27 +1000 (EST)
Message-ID: <449F8736.3080508@thebeastie.org>
Date: Mon, 26 Jun 2006 17:05:26 +1000
From: Michael Vince <mv@thebeastie.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.7.12) Gecko/20060404
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Nikolas Britton <nikolas.britton@gmail.com>
References: <ef10de9a0606250157jce24553h52e67db7a9f76b03@mail.gmail.com>	<f6791cc60606250835p51c966e7xa12fb241c9aaab8d@mail.gmail.com>	<ef10de9a0606250930k6b655e2bkb81694905454bf58@mail.gmail.com>
	<ef10de9a0606251523h4102e782m1fe2403c57c80e57@mail.gmail.com>
In-Reply-To: <ef10de9a0606251523h4102e782m1fe2403c57c80e57@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-Virus-Scanned: ClamAV using ClamSMTP
Cc: performance@freebsd.org, freebsd-stable@freebsd.org,
	Sean Bryant <bryants@gmail.com>
Subject: Re: Gigabit ethernet very slow.
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Jun 2006 07:05:38 -0000

Nikolas Britton wrote:

> On 6/25/06, Nikolas Britton <nikolas.britton@gmail.com> wrote:
>
>> On 6/25/06, Sean Bryant <bryants@gmail.com> wrote:
>> > /dev/zero not exactly the best way to test sending data across the
>> > network. Especially since you'll be reading a 8k chunks.
>> >
>> > I could be wrong, strong possibility that I am. I only got 408mb when
>> > doing a /dev/zero test. I've managed to saturate though. Using other
>> > software that I wrote.
>> > On 6/25/06, Nikolas Britton <nikolas.britton@gmail.com> wrote:
>> > > What's up with my computer, it's only getting 30MB/s?
>> > >
>> > > hostB: nc -4kl port > /dev/null
>> > > hostA: nc host port < /dev/zero
>> > >
>>
>> 408MByte/s or 408Mbit/s and what measuring stick are you using? I'm
>> trying to rule in/out problems with the disks, I'm only getting
>> ~25MB/s on a 6 disk RAID0 over the network... would it be better to
>> setup an memory backed disk, md(4) , to read from?
>>
>>
>
> Now I'm getting 523.2Mbit/s (65.4MB/s) with netcat, I wiped out the
> FreeBSD 6.1/amd64 install with FreeBSD 6.1/i386... and...
>
> After a kernel rebuild (recompiled nc too):
> CPUTYPE?=athlon-mp
> CFLAGS+= -mtune=athlon64
> COPTFLAGS+= -mtune=athlon64
>
> I'm up to 607.2Mbit/s (75.9MB/s). What else can I do to get that
> number higher, and how can I get interrupts lower?
>
> Before recompile:
> load averages:  0.94,  0.91,  0.66
> CPU states:  2.6% user,  0.0% nice, 21.5% system, 64.6% interrupt, 
> 11.3% idle
> -------------------
> After recompile:
> load averages:  0.99,  0.96,  0.76
> CPU states:  3.0% user,  0.0% nice, 33.7% system, 58.2% interrupt,  
> 5.1% idle
>
Out of interested I tried the same test with nc but with dd in the pipe 
or by watching it by pftop.

According to pftop (with modulate state rules) I am able to get about 
85megs/sec when I don't have dd running. dd does indeed eats a fair 
amount of cpu (40%) on the AMD64 6-stable machine.

With a dd pipe I am able to get roughly 70megs/sec  between 2 Dell 
machines, one of them being AMD64 (I ran dd on this one as its has 2 
CPUs). pftop confirms this figure as well.

cat /dev/zero | dd | nc host 3000
2955297+0 records in
2955297+0 records out
1513112064 bytes transferred in 20.733547 secs (72978930 bytes/sec)

These machines are also doing regular work and not idle.

I tested on another remote network setup as well, with a 3 FreeBSD 
setup, 1 client  one FreeBSD gateway and 3rd server. (host-A 
----host-B----host-C) HostA is the only one using 6-stable all others 
are 6.1.
None of these machines have polling and are all em devices (Dell servers).


Going from C to A (via B) gives 50megs/sec
host-C#cat /dev/zero | dd | nc host-A 3000
15000154+0 records in
15000153+0 records out
7680078336 bytes transferred in 152.320171 secs (50420626 bytes/sec)


Between them directly they all appear to give around 55-85megs/sec.

The shocker I found was sending data from hostA to hostC which appears 
to only give 1 meg/sec
host-A#cat /dev/zero | dd | nc host-C 3000
40135+0 records in
40134+0 records out
20548608 bytes transferred in 19.250176 secs (1067450 bytes/sec)

Host-A to Host-B. Actually all tests sending data from outside into 
anything past Host-B's internal network interface caused a massive drop 
in performance 800kbytes/sec
host-A#cat /dev/zero | dd | nc host-B(internal interface ip) 3000
58041+0 records in
58040+0 records out
29716480 bytes transferred in 36.137952 secs (822307 bytes/sec)

Going from Host-A to Host-B's external interface gives still gives fast 
results around 60megs/sec
host-A#cat /dev/zero | dd | nc host-B(external interface ip) 3000
4984545+0 records in
4984544+0 records out
2552086528 bytes transferred in 40.569696 secs (62906227 bytes/sec)

Speed from host-B (gateway) to Host-A is still ok at around 50megs/sec
host-B#cat /dev/zero | dd | nc host-A 3000
8826036+0 records in
8826035+0 records out
4518929920 bytes transferred in 80.471211 secs (56155858 bytes/sec)

Connecting from the internal server to the internal gateway ip gives a 
good speed around 70megs/sec
host-C#cat /dev/zero | dd | nc host-B(internal interface ip) 3000
6176688+0 records in
6176688+0 records out
3162464256 bytes transferred in 42.100412 secs (75117181 bytes/sec)

Interestingly connecting to the external interface of the gateway from 
the internal machine still gave good speeds around 70megs/sec
host-C# cat /dev/zero | dd | nc nc host-B(external interface ip) 3000
7107351+0 records in
7107351+0 records out
3638963712 bytes transferred in 49.451670 secs (73586265 bytes/sec)

I used to run the gateway with polling but ditched it when upgrading 
from 6.0 to 6.1 since the improved em driver came into 6.1
Would any one have any explaination as to why incomming data from Host A 
thru B to its most distant interface from Host-A would give such poor 
performance (1meg/sec) while going the other way seems to be fine?
Obviously its something going on inside the FreeBSD kernel as interface 
to interface tests are fine.

Its a its a Dell 1850 with 6.1-release-amd64 with pf rules enabled. The 
only only special kernel changes are FAST_IPSEC.
I tested with these sysctls 0-1 and they made no difference.
net.isr.direct=1
net.inet.ip.fastforwarding=1

Mike


From owner-freebsd-performance@FreeBSD.ORG  Mon Jun 26 08:07:46 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8D51A16A40E
	for <freebsd-performance@freebsd.org>;
	Mon, 26 Jun 2006 08:07:46 +0000 (UTC)
	(envelope-from leo.huang.list@gmail.com)
Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.173])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 30EAD43D46
	for <freebsd-performance@freebsd.org>;
	Mon, 26 Jun 2006 08:07:33 +0000 (GMT)
	(envelope-from leo.huang.list@gmail.com)
Received: by ug-out-1314.google.com with SMTP id m3so843828uge
	for <freebsd-performance@freebsd.org>;
	Mon, 26 Jun 2006 01:07:30 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:mime-version:content-type;
	b=SXYXnBvfNuVSFQXSpzrx+Ff/E0gTcnbn6XvyrQlWK9InVBBNh3LCLRSjP5+12/4q4Hf5l1JQpgnL5xoTDMKZaDw50fk5ff0AVV+3DiXlykMwPGmvp49UOniAB3UyVoMLuelaa/Ayd9wCpy6BPR54x1rUSEqtD7oMTGLP+EN2a9s=
Received: by 10.78.117.10 with SMTP id p10mr1931785huc;
	Mon, 26 Jun 2006 01:07:30 -0700 (PDT)
Received: by 10.78.32.18 with HTTP; Mon, 26 Jun 2006 01:07:30 -0700 (PDT)
Message-ID: <14a4a8480606260107g3c23456bke40bbb11f09a7917@mail.gmail.com>
Date: Mon, 26 Jun 2006 16:07:30 +0800
From: "leo huang" <leo.huang.list@gmail.com>
To: freebsd-performance@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=GB2312; format=flowed
Content-Transfer-Encoding: base64
Content-Disposition: inline
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: Is the fsync() fake on FreeBSD6.1?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Jun 2006 08:07:46 -0000

SGksCgpJIGJlbmNobWFya2VkIE15U1FMIDQuMS4xOCBvbiBGcmVlQlNEIDYuMSBhbmQgRGViaWFu
IDMuMSB1c2luZyBTdXBlciBTbWFjawoxLjMgc29tZSBkYXlzIGFnby4KClRoZSBiZW5jaG1hcmsg
dGFibGUgIGlzCkNSRUFURSBUQUJMRSBgQWNjb3VudGAgKAogIGBhaWRgIGludCgxMSkgTk9UIE5V
TEwgYXV0b19pbmNyZW1lbnQsCiAgYG5hbWVgIGNoYXIoMjApIE5PVCBOVUxMIGRlZmF1bHQgJycs
CiAgYGZsYWdgIGludCgxMSkgTk9UIE5VTEwgZGVmYXVsdCAnMCcsCiAgYHVpZGNvdW50YCBpbnQo
MTEpIE5PVCBOVUxMIGRlZmF1bHQgJzAnLAogIGBiYWxhbmNlYCBpbnQoMTEpIE5PVCBOVUxMIGRl
ZmF1bHQgJzAnLAogIGBwb2ludGAgaW50KDExKSBOT1QgTlVMTCBkZWZhdWx0ICcwJywKICBgYmxv
Y2t0bWAgaW50KDExKSBOT1QgTlVMTCBkZWZhdWx0ICcwJywKICBgaXBudW1gIGludCgxMCkgdW5z
aWduZWQgZGVmYXVsdCBOVUxMLAogIGBuZXdkYXRlYCBkYXRldGltZSBkZWZhdWx0IE5VTEwsCiAg
UFJJTUFSWSBLRVkgIChgYWlkYCksCiAgVU5JUVVFIEtFWSBgbmFtZWAgKGBuYW1lYCkKKSBFTkdJ
TkU9SW5ub0RCIERFRkFVTFQgQ0hBUlNFVD1sYXRpbjE7CgpBbmQgaXQgaGFzIDEwLDAwMCwwMDAg
cm93cy4KClRoZSBTUUwgc3RhdGVtZW50IGlzCnVwZGF0ZSBBY2NvdW50IHNldCBiYWxhbmNlPSBi
YWxhbmNlICsgMSB3aGVyZSBhaWQ9PzsKClRoZSByZXN1bHQgaXMgZm9sbG93ZWQ6Ck9TICAgICAg
ICAgICAgICAgIENsaWVudHMgICAgICAgIFJlc3VsdChxdWVyaWVzIHBlciBzZWNvbmQpICAgICAg
ICAgVFBTKGdvdApmcm9tIGlvc3RhdCkKRnJlZUJTRDYuMSAgICA1MCAgICAgICAgICAgICAgIDUx
Ni4xCmFib3V0IDIwMDAKRGViaWFuMy4xICAgICAgIDUwICAgICAgICAgICAgICAgNDkuOAphYm91
dCAyMDAKClRoZSByZXN1bHQgc3VycHJpc2UgbWUuIFRoZSBNeVNRTCBQZXJmb3JtYW5jZSBvbiBG
cmVlQlNENi4xIGlzIGFib3V0IDEwCnRpbWVzIG9mIG9uIERlYmlhbjMuMaOsYW5kIHRoZSBvdXRw
dXQgb2YgaW9zdGF0IGFsc28gc2hvd3MgaXQuCgpJIGtub3cgdGhhdCBNeVNRTCB1c2VzIGZzeW5j
KCkgdG8gZmx1c2ggYm90aCB0aGUgZGF0YSBhbmQgbG9nIGZpbGVzIGF0CmRlZmF1bHQgd2hlbiB1
c2luZyBpbm5vZGIgZW5naW5lKApodHRwOi8vZGV2Lm15c3FsLmNvbS9kb2MvcmVmbWFuLzQuMS9l
bi9pbm5vZGItcGFyYW1ldGVycy5odG1sKS4gT3VyCmV2YWx1YXRpbmcgY29tcHV0ZXIgb25seSBo
YXMgYSAxMDAwMFJQTSBTQ1NJIGhhcmQgZGlzay4gSSB0aGluayBpdCBjYW4gZG8KYWJvdXQgMjAw
IHNlcXVlbnRpYWwgZnN5bmMoKSBjYWxscyBwZXIgc2Vjb25kIGlmIHRoZSBmc3luYygpIGlzIHJl
YWwuCgpJcyB0aGUgZnN5bmMoKSBvbiBGcmVlQlNENi4xIGZha2U/IEkgbWVhbiB0aGFuIHRoZSBk
YXRhIGlzIG9ubHkgd3JpdHRlbiB0bwp0aGUgZHJpdmVzIG1lbW9yeSBhbmQgc28gY2FuIGJlIGxv
c3QgaWYgcG93ZXIgZ29lcyBkb3duLiBBbmQgaG93IEkgY2FuCmNvbmZpcm0gdGhpcz8KCklmIHRo
ZSBmc3luYygpIGlzIGZha2UsIGhvdyBjYW4gSSBnZXQgdGhlIHJlYWwgZnN5bmM/CgpBbnkgY29t
bWVudCBpcyB3ZWxjb21lIQoKUFM6CjEuIE91ciBldmFsdWF0aW5nIGNvbXB1dGVyIGlzIERFTEwg
UG93ZXJFZGdlIDE2NTCho0l0cyBoYXJkd2FyZSBjb25maWd1cmF0aW9uCmlzIGZvbGxvd2VkOgog
ICAgQ1BVOiAyICogSW50ZWwgUGVudGl1bSBJSUkgMS4zM0dIeiA1MTJLQiBMZXZlbCAyIENhY2hl
KHNtcCkKICAgIE1lbW9yeTogMTAyNE1CIEVDQyBTRFJBTQogICAgSEQ6IFNFQUdBVEUgU1QzMzY3
MDZMQ6OoMzZHQiBVbHRyYTE2MCBTQ1NJIDEwMDAwUlBNo6kKICAgIE5JQyAgICA6IEludGVsKFIp
IFBSTy8xMDAwIE5ldHdvcmsgQ29ubmVjdGlvbgoKMi4gU29tZSBpbXBvcnRhbnQgcGFyYW1ldGVy
cyBpbiBNeVNRTCBjb25maWd1cmF0aW9uIGZpbGUgYXJlIGhlcmU6CiAgICBsb2ctYmluCiAgICBz
eW5jX2JpbmxvZz0xCiAgICBpbm5vZGJfc2FmZV9iaW5sb2cKICAgIGlubm9kYl9idWZmZXJfcG9v
bF9zaXplID0gMzg0TQogICAgaW5ub2RiX2FkZGl0aW9uYWxfbWVtX3Bvb2xfc2l6ZSA9IDIwTQog
ICAgaW5ub2RiX2xvZ19maWxlX3NpemUgPSAxMDBNCiAgICBpbm5vZGJfbG9nX2J1ZmZlcl9zaXpl
ID0gOE0KICAgIGlubm9kYl9mbHVzaF9sb2dfYXRfdHJ4X2NvbW1pdCA9IDEKICAgIGlubm9kYl9s
b2NrX3dhaXRfdGltZW91dCA9IDUwCgoKcmVnYXJkcywKTGVvIEh1YW5nCg==

From owner-freebsd-performance@FreeBSD.ORG  Mon Jun 26 10:34:14 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: performance@freebsd.org
Delivered-To: freebsd-performance@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AB0BB16A40B
	for <performance@freebsd.org>; Mon, 26 Jun 2006 10:34:14 +0000 (UTC)
	(envelope-from tataz@tataz.chchile.org)
Received: from smtp3-g19.free.fr (smtp3-g19.free.fr [212.27.42.29])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3D9AB444C0
	for <performance@freebsd.org>; Mon, 26 Jun 2006 10:34:03 +0000 (GMT)
	(envelope-from tataz@tataz.chchile.org)
Received: from tatooine.tataz.chchile.org (tataz.chchile.org [82.233.239.98])
	by smtp3-g19.free.fr (Postfix) with ESMTP id 37A5449439;
	Mon, 26 Jun 2006 12:34:02 +0200 (CEST)
Received: from obiwan.tataz.chchile.org (unknown [192.168.1.25])
	by tatooine.tataz.chchile.org (Postfix) with ESMTP id 5EB5F9B904;
	Mon, 26 Jun 2006 10:34:34 +0000 (UTC)
Received: by obiwan.tataz.chchile.org (Postfix, from userid 1000)
	id 4FBA840AA; Mon, 26 Jun 2006 12:34:31 +0200 (CEST)
Date: Mon, 26 Jun 2006 12:34:31 +0200
From: Jeremie Le Hen <jeremie@le-hen.org>
To: Lucas Holt <luke@foolishgames.com>
Message-ID: <20060626103431.GC10272@obiwan.tataz.chchile.org>
References: <446CCE1C.1050200@fer.hr> <446CD873.9080903@stevehodgson.co.uk>
	<446CE6CE.50009@fer.hr> <446D8994.3070709@thebeastie.org>
	<446D9DEE.4050300@fer.hr> <446DE1F2.4020602@thebeastie.org>
	<446DE927.2060909@fer.hr>
	<BD967927-49D5-4AFC-AC43-C52538EFAEE9@foolishgames.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <BD967927-49D5-4AFC-AC43-C52538EFAEE9@foolishgames.com>
User-Agent: Mutt/1.5.11
Cc: performance@freebsd.org, Michael Vince <mv@thebeastie.org>,
	Ivan Voras <ivoras@fer.hr>
Subject: Re: [fbsd] Re: (Another) simple benchmark
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Jun 2006 10:34:14 -0000

Hi,

> I think most people just want you to use the exact same configuration  
> file on both systems and verify that both use the same pre-compiled  
> options as well.  Probably the best thing to do is build from source  
> on both linux and freebsd with the same options.  Make sure the  
> defaults for each system are fair as well.  Then use the same config  
> file or as close as possible on both operating systems.  Otherwise,  
> you're not testing the same thing.   If you were testing against  
> another architecture like say windows, then you would need to use the  
> default worker type for windows.

While I agree this kind of benchmark is worth doing in order to compare
kernel paths journeyed by a particular program on multiple OSes, I think
Ivan's test is still valuable as it compares the out-of-the-box
performances of the Apache package and the underlying OS.

This is the administrator's work to fiddle any build-time or run-time
relevant option in order to get the best performances, but we can't
assume this is the way the actual world goes.  When third-party people
perform benchmarks in order to spot out the best OS to run an HTTP
server, the configuration is not likely to be the best one.  And given
those benchmark's results are often issued on wide-audience websites
such as Slashdot, it is really important from a marketing standpoint.

Best regards,
-- 
Jeremie Le Hen
< jeremie at le-hen dot org >< ttz at chchile dot org >

From owner-freebsd-performance@FreeBSD.ORG  Mon Jun 26 21:26:19 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: performance@freebsd.org
Delivered-To: freebsd-performance@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A622916A402;
	Mon, 26 Jun 2006 21:26:19 +0000 (UTC)
	(envelope-from fullermd@over-yonder.net)
Received: from mail.localelinks.com (web.localelinks.com [64.39.75.54])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 74E6143D79;
	Mon, 26 Jun 2006 21:26:11 +0000 (GMT)
	(envelope-from fullermd@over-yonder.net)
Received: from draco.over-yonder.net
	(adsl-072-148-013-213.sip.jan.bellsouth.net [72.148.13.213])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.localelinks.com (Postfix) with ESMTP id 8DBD13E6;
	Mon, 26 Jun 2006 15:26:49 -0500 (CDT)
Received: by draco.over-yonder.net (Postfix, from userid 100)
	id 8BB1E61C2B; Mon, 26 Jun 2006 14:32:26 -0500 (CDT)
Date: Mon, 26 Jun 2006 14:32:26 -0500
From: "Matthew D. Fuller" <fullermd@over-yonder.net>
To: Michael Vince <mv@thebeastie.org>
Message-ID: <20060626193226.GF74292@over-yonder.net>
References: <ef10de9a0606250157jce24553h52e67db7a9f76b03@mail.gmail.com>
	<f6791cc60606250835p51c966e7xa12fb241c9aaab8d@mail.gmail.com>
	<ef10de9a0606250930k6b655e2bkb81694905454bf58@mail.gmail.com>
	<ef10de9a0606251523h4102e782m1fe2403c57c80e57@mail.gmail.com>
	<449F8736.3080508@thebeastie.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <449F8736.3080508@thebeastie.org>
X-Editor: vi
X-OS: FreeBSD <http://www.freebsd.org/>
User-Agent: Mutt/1.5.11-fullermd.3
Cc: freebsd-stable@freebsd.org, performance@freebsd.org,
	Nikolas Britton <nikolas.britton@gmail.com>,
	Sean Bryant <bryants@gmail.com>
Subject: Re: Gigabit ethernet very slow.
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Jun 2006 21:26:19 -0000

On Mon, Jun 26, 2006 at 05:05:26PM +1000 I heard the voice of
Michael Vince, and lo! it spake thus:
> 
> According to pftop (with modulate state rules) I am able to get
> about 85megs/sec when I don't have dd running. dd does indeed eats a
> fair amount of cpu (40%) on the AMD64 6-stable machine.

dd does ridiculously small (512 byte?) read/writes, so it's gotta do a
LOT of system calls and a lot of context switching when you don't give
it a bigger blocksize.


-- 
Matthew Fuller     (MF4839)   |  fullermd@over-yonder.net
Systems/Network Administrator |  http://www.over-yonder.net/~fullermd/
           On the Internet, nobody can hear you scream.

From owner-freebsd-performance@FreeBSD.ORG  Tue Jun 27 02:18:53 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 95A7A16A400
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 02:18:53 +0000 (UTC)
	(envelope-from leo.huang.list@gmail.com)
Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.172])
	by mx1.FreeBSD.org (Postfix) with ESMTP id DE25A43D60
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 02:18:48 +0000 (GMT)
	(envelope-from leo.huang.list@gmail.com)
Received: by ug-out-1314.google.com with SMTP id m3so1287359uge
	for <freebsd-performance@freebsd.org>;
	Mon, 26 Jun 2006 19:18:47 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:mime-version:content-type;
	b=IqQSxKa+ZQ1sqLSB46fw0RLIG+DFjq2yBmU/ximjWIAtdCb4+SBnmEEps76eW58ivwyPkI1UZ7AozLRDS7Awpo4ckULyaJucmmBCji0qf1yD6GTgzgncwRhkahwfb/Cg90yezAKI1WFCPFZNmXiJGg/BViieNgQVkXQf66hlstk=
Received: by 10.78.178.5 with SMTP id a5mr2344639huf;
	Mon, 26 Jun 2006 19:18:47 -0700 (PDT)
Received: by 10.78.32.18 with HTTP; Mon, 26 Jun 2006 19:18:47 -0700 (PDT)
Message-ID: <14a4a8480606261918q39b51f7bkd69958c5a7b05021@mail.gmail.com>
Date: Tue, 27 Jun 2006 10:18:47 +0800
From: "leo huang" <leo.huang.list@gmail.com>
To: freebsd-performance@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=GB2312; format=flowed
Content-Transfer-Encoding: base64
Content-Disposition: inline
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: Is the fsync() fake on FreeBSD6.1?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jun 2006 02:18:53 -0000

SGksCgpJIGJlbmNobWFya2VkIE15U1FMIDQuMS4xOCBvbiBGcmVlQlNEIDYuMSBhbmQgRGViaWFu
IDMuMSB1c2luZyBTdXBlciBTbWFjawoxLjMgc29tZSBkYXlzIGFnby4KClRoZSBiZW5jaG1hcmsg
dGFibGUgIGlzCkNSRUFURSBUQUJMRSBgQWNjb3VudGAgKAogIGBhaWRgIGludCgxMSkgTk9UIE5V
TEwgYXV0b19pbmNyZW1lbnQsCiAgYG5hbWVgIGNoYXIoMjApIE5PVCBOVUxMIGRlZmF1bHQgJycs
CiAgYGZsYWdgIGludCgxMSkgTk9UIE5VTEwgZGVmYXVsdCAnMCcsCiAgYHVpZGNvdW50YCBpbnQo
MTEpIE5PVCBOVUxMIGRlZmF1bHQgJzAnLAogIGBiYWxhbmNlYCBpbnQoMTEpIE5PVCBOVUxMIGRl
ZmF1bHQgJzAnLAogIGBwb2ludGAgaW50KDExKSBOT1QgTlVMTCBkZWZhdWx0ICcwJywKICBgYmxv
Y2t0bWAgaW50KDExKSBOT1QgTlVMTCBkZWZhdWx0ICcwJywKICBgaXBudW1gIGludCgxMCkgdW5z
aWduZWQgZGVmYXVsdCBOVUxMLAogIGBuZXdkYXRlYCBkYXRldGltZSBkZWZhdWx0IE5VTEwsCiAg
UFJJTUFSWSBLRVkgIChgYWlkYCksCiAgVU5JUVVFIEtFWSBgbmFtZWAgKGBuYW1lYCkKKSBFTkdJ
TkU9SW5ub0RCIERFRkFVTFQgQ0hBUlNFVD1sYXRpbjE7CgpBbmQgaXQgaGFzIDEwLDAwMCwwMDAg
cm93cy4KClRoZSBTUUwgc3RhdGVtZW50IGlzCnVwZGF0ZSBBY2NvdW50IHNldCBiYWxhbmNlPSBi
YWxhbmNlICsgMSB3aGVyZSBhaWQ9PzsKClRoZSByZXN1bHQgaXMgZm9sbG93ZWQ6Ck9TICAgICAg
ICAgICAgICAgIENsaWVudHMgICAgICAgIFJlc3VsdChxdWVyaWVzIHBlciBzZWNvbmQpICAgICAg
ICAgVFBTKGdvdApmcm9tIGlvc3RhdCkKRnJlZUJTRDYuMSAgICA1MCAgICAgICAgICAgICAgIDUx
Ni4xCmFib3V0IDIwMDAKRGViaWFuMy4xICAgICAgIDUwICAgICAgICAgICAgICAgNDkuOAphYm91
dCAyMDAKClRoZSByZXN1bHQgc3VycHJpc2UgbWUuIFRoZSBNeVNRTCBQZXJmb3JtYW5jZSBvbiBG
cmVlQlNENi4xIGlzIGFib3V0IDEwCnRpbWVzIG9mIG9uIERlYmlhbjMuMaOsYW5kIHRoZSBvdXRw
dXQgb2YgaW9zdGF0IGFsc28gc2hvd3MgaXQuCgpJIGtub3cgdGhhdCBNeVNRTCB1c2VzIGZzeW5j
KCkgdG8gZmx1c2ggYm90aCB0aGUgZGF0YSBhbmQgbG9nIGZpbGVzIGF0CmRlZmF1bHQgd2hlbiB1
c2luZyBpbm5vZGIgZW5naW5lKApodHRwOi8vZGV2Lm15c3FsLmNvbS9kb2MvcmVmbWFuLzQuMS9l
bi9pbm5vZGItcGFyYW1ldGVycy5odG1sKS4gT3VyCmV2YWx1YXRpbmcgY29tcHV0ZXIgb25seSBo
YXMgYSAxMDAwMFJQTSBTQ1NJIGhhcmQgZGlzay4gSSB0aGluayBpdCBjYW4gZG8KYWJvdXQgMjAw
IHNlcXVlbnRpYWwgZnN5bmMoKSBjYWxscyBwZXIgc2Vjb25kIGlmIHRoZSBmc3luYygpIGlzIHJl
YWwuCgpJcyB0aGUgZnN5bmMoKSBvbiBGcmVlQlNENi4xIGZha2U/IEkgbWVhbiB0aGFuIHRoZSBk
YXRhIGlzIG9ubHkgd3JpdHRlbiB0bwp0aGUgZHJpdmVzIG1lbW9yeSBhbmQgc28gY2FuIGJlIGxv
c3QgaWYgcG93ZXIgZ29lcyBkb3duLiBBbmQgaG93IEkgY2FuCmNvbmZpcm0gdGhpcz8KCklmIHRo
ZSBmc3luYygpIGlzIGZha2UsIGhvdyBjYW4gSSBnZXQgdGhlIHJlYWwgZnN5bmM/CgpBbnkgY29t
bWVudCBpcyB3ZWxjb21lIQoKUFM6CjEuIE91ciBldmFsdWF0aW5nIGNvbXB1dGVyIGlzIERFTEwg
UG93ZXJFZGdlIDE2NTCho0l0cyBoYXJkd2FyZSBjb25maWd1cmF0aW9uCmlzIGZvbGxvd2VkOgog
ICAgQ1BVOiAyICogSW50ZWwgUGVudGl1bSBJSUkgMS4zM0dIeiA1MTJLQiBMZXZlbCAyIENhY2hl
KHNtcCkKICAgIE1lbW9yeTogMTAyNE1CIEVDQyBTRFJBTQogICAgSEQ6IFNFQUdBVEUgU1QzMzY3
MDZMQ6OoMzZHQiBVbHRyYTE2MCBTQ1NJIDEwMDAwUlBNo6kKICAgIE5JQyAgICA6IEludGVsKFIp
IFBSTy8xMDAwIE5ldHdvcmsgQ29ubmVjdGlvbgoKMi4gU29tZSBpbXBvcnRhbnQgcGFyYW1ldGVy
cyBpbiBNeVNRTCBjb25maWd1cmF0aW9uIGZpbGUgYXJlIGhlcmU6CiAgICBsb2ctYmluCiAgICBz
eW5jX2JpbmxvZz0xCiAgICBpbm5vZGJfc2FmZV9iaW5sb2cKICAgIGlubm9kYl9idWZmZXJfcG9v
bF9zaXplID0gMzg0TQogICAgaW5ub2RiX2FkZGl0aW9uYWxfbWVtX3Bvb2xfc2l6ZSA9IDIwTQog
ICAgaW5ub2RiX2xvZ19maWxlX3NpemUgPSAxMDBNCiAgICBpbm5vZGJfbG9nX2J1ZmZlcl9zaXpl
ID0gOE0KICAgIGlubm9kYl9mbHVzaF9sb2dfYXRfdHJ4X2NvbW1pdCA9IDEKICAgIGlubm9kYl9s
b2NrX3dhaXRfdGltZW91dCA9IDUwCgoKcmVnYXJkcywKTGVvIEh1YW5nCg==

From owner-freebsd-performance@FreeBSD.ORG  Tue Jun 27 02:58:36 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id E284016A400
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 02:58:35 +0000 (UTC)
	(envelope-from anderson@centtech.com)
Received: from mh1.centtech.com (moat3.centtech.com [207.200.51.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8E99443D6A
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 02:58:35 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from [192.168.42.24] (andersonbox4.centtech.com [192.168.42.24])
	by mh1.centtech.com (8.13.1/8.13.1) with ESMTP id k5R2wYdT063052;
	Mon, 26 Jun 2006 21:58:34 -0500 (CDT)
	(envelope-from anderson@centtech.com)
Message-ID: <44A09EE7.1030405@centtech.com>
Date: Mon, 26 Jun 2006 21:58:47 -0500
From: Eric Anderson <anderson@centtech.com>
User-Agent: Thunderbird 1.5.0.4 (X11/20060612)
MIME-Version: 1.0
To: leo huang <leo.huang.list@gmail.com>
References: <14a4a8480606261918q39b51f7bkd69958c5a7b05021@mail.gmail.com>
In-Reply-To: <14a4a8480606261918q39b51f7bkd69958c5a7b05021@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV 0.87.1/1564/Mon Jun 26 09:55:16 2006 on
	mh1.centtech.com
X-Virus-Status: Clean
Cc: freebsd-performance@freebsd.org
Subject: Re: Is the fsync() fake on FreeBSD6.1?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jun 2006 02:58:36 -0000

leo huang wrote:
> Hi,
> 
> I benchmarked MySQL 4.1.18 on FreeBSD 6.1 and Debian 3.1 using Super Smack
> 1.3 some days ago.
> 
> The benchmark table  is
> CREATE TABLE `Account` (
>  `aid` int(11) NOT NULL auto_increment,
>  `name` char(20) NOT NULL default '',
>  `flag` int(11) NOT NULL default '0',
>  `uidcount` int(11) NOT NULL default '0',
>  `balance` int(11) NOT NULL default '0',
>  `point` int(11) NOT NULL default '0',
>  `blocktm` int(11) NOT NULL default '0',
>  `ipnum` int(10) unsigned default NULL,
>  `newdate` datetime default NULL,
>  PRIMARY KEY  (`aid`),
>  UNIQUE KEY `name` (`name`)
> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
> 
> And it has 10,000,000 rows.
> 
> The SQL statement is
> update Account set balance= balance + 1 where aid=?;
> 
> The result is followed:
> OS                Clients        Result(queries per second)         TPS(got
> from iostat)
> FreeBSD6.1    50               516.1
> about 2000
> Debian3.1       50               49.8
> about 200
> 
> The result surprise me. The MySQL Performance on FreeBSD6.1 is about 10
> times of on Debian3.1,and the output of iostat also shows it.
> 
> I know that MySQL uses fsync() to flush both the data and log files at
> default when using innodb engine(
> http://dev.mysql.com/doc/refman/4.1/en/innodb-parameters.html). Our
> evaluating computer only has a 10000RPM SCSI hard disk. I think it can do
> about 200 sequential fsync() calls per second if the fsync() is real.
> 
> Is the fsync() on FreeBSD6.1 fake? I mean than the data is only written to
> the drives memory and so can be lost if power goes down. And how I can
> confirm this?
> 
> If the fsync() is fake, how can I get the real fsync?
> 
> Any comment is welcome!
> 
> PS:
> 1. Our evaluating computer is DELL PowerEdge 1650?Its hardware 
> configuration
> is followed:
>    CPU: 2 * Intel Pentium III 1.33GHz 512KB Level 2 Cache(smp)
>    Memory: 1024MB ECC SDRAM
>    HD: SEAGATE ST336706LC(36GB Ultra160 SCSI 10000RPM)
>    NIC    : Intel(R) PRO/1000 Network Connection
> 
> 2. Some important parameters in MySQL configuration file are here:
>    log-bin
>    sync_binlog=1
>    innodb_safe_binlog
>    innodb_buffer_pool_size = 384M
>    innodb_additional_mem_pool_size = 20M
>    innodb_log_file_size = 100M
>    innodb_log_buffer_size = 8M
>    innodb_flush_log_at_trx_commit = 1
>    innodb_lock_wait_timeout = 50


Hi Leo,

I think we've all received this message twice now, however if you are 
resending because you didn't get any responses, it's probably because 
most people on this list are here to discuss performance, and while this 
is related, it is really a filesystem question, and so many people here 
just won't know the answer for you.  You might get a better response on 
the freebsd-fs@ list.

Anyway, there are a lot of variables here, but it could be softupdates. 
  Have you tried turning softupdates off on the filesystem you are 
running this on, and/or enabling the 'sync' option to the filesystem?

Also, you didn't mention anything about which filesystems these were 
using on both occasions.

Eric


-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------

From owner-freebsd-performance@FreeBSD.ORG  Tue Jun 27 03:34:19 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0253116A405
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 03:34:19 +0000 (UTC) (envelope-from grog@lemis.com)
Received: from wantadilla.lemis.com (wantadilla.lemis.com [192.109.197.135])
	by mx1.FreeBSD.org (Postfix) with ESMTP id C0BB543D6B
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 03:34:13 +0000 (GMT) (envelope-from grog@lemis.com)
Received: by wantadilla.lemis.com (Postfix, from userid 1004)
	id E1B229B482; Tue, 27 Jun 2006 13:04:12 +0930 (CST)
Date: Tue, 27 Jun 2006 13:04:12 +0930
From: Greg 'groggy' Lehey <grog@FreeBSD.org>
To: leo huang <leo.huang.list@gmail.com>
Message-ID: <20060627033412.GQ10845@wantadilla.lemis.com>
References: <14a4a8480606261918q39b51f7bkd69958c5a7b05021@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="U/5EjKfnYgGK6hcj"
Content-Disposition: inline
In-Reply-To: <14a4a8480606261918q39b51f7bkd69958c5a7b05021@mail.gmail.com>
User-Agent: Mutt/1.4.2.1i
Organization: The FreeBSD Project
Phone: +61-8-8388-8286
Fax: +61-8-8388-8725
Mobile: +61-418-838-708
VoIP: sip:0871270137@sip.internode.on.net
WWW-Home-Page: http://www.FreeBSD.org/
X-PGP-Fingerprint: 9A1B 8202 BCCE B846 F92F  09AC 22E6 F290 507A 4223
Cc: freebsd-performance@freebsd.org
Subject: Re: Is the fsync() fake on FreeBSD6.1?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jun 2006 03:34:19 -0000


--U/5EjKfnYgGK6hcj
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Tuesday, 27 June 2006 at 10:18:47 +0800, leo huang wrote:
> Hi,
>
> I benchmarked MySQL 4.1.18 on FreeBSD 6.1 and Debian 3.1 using Super Smack
> 1.3 some days ago.
>
> ...
>
> The result surprise me. The MySQL Performance on FreeBSD6.1 is about
> 10 times of on Debian3.1??and the output of iostat also shows it.
>
> I know that MySQL uses fsync() to flush both the data and log files
> at default when using innodb engine(
> http://dev.mysql.com/doc/refman/4.1/en/innodb-parameters.html). Our
> evaluating computer only has a 10000RPM SCSI hard disk. I think it
> can do about 200 sequential fsync() calls per second if the fsync()
> is real.
>
> Is the fsync() on FreeBSD6.1 fake?

My understanding from the last time I looked at the code was that
fsync does the right thing:

     The fsync() system call causes all modified data and attributes of fd to
     be moved to a permanent storage device.  This normally results in all in-
     core modified copies of buffers for the associated file to be written to
     a disk.

This is not the case for Linux, where fsync syncs the entire file
system.  That could explain some of the performance difference, but
not all of it.  I suppose it's worth noting that, in general, people
report much better performance with MySQL on Linux than on FreeBSD.

> I mean than the data is only written to the drives memory and so can
> be lost if power goes down.

I don't believe that fsync is required to flush the drive buffers.  It
would be nice to have a function that did, though.

> And how I can confirm this?

Trial and error?

Greg
--
See complete headers for address and phone numbers.

--U/5EjKfnYgGK6hcj
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (FreeBSD)

iD8DBQFEoKc0IubykFB6QiMRAuteAKCHeyDhqwRJxhOPrJnGps4o+xnqbACdHJpU
Mvsi7Crmh6+JcQtJAmGoZR4=
=Wad+
-----END PGP SIGNATURE-----

--U/5EjKfnYgGK6hcj--

From owner-freebsd-performance@FreeBSD.ORG  Tue Jun 27 07:36:13 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6F74316A54C
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 07:36:13 +0000 (UTC)
	(envelope-from leo.huang.list@gmail.com)
Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.169])
	by mx1.FreeBSD.org (Postfix) with ESMTP id DB52944568
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 07:16:51 +0000 (GMT)
	(envelope-from leo.huang.list@gmail.com)
Received: by ug-out-1314.google.com with SMTP id m3so1374037uge
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 00:16:50 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=HVVCd22mYVWtdirCq31Wl0TSaX+/alo4Snj2ejpAlR8t1T5ums3tgMrPm28Dc4iMIXEYjozierzvCHrHH58dXlawJZiFqm9K9S9pmU8MBLePkwyo8ItHq1mS/nu/uiUse7Df4D3Ih1BoCs9aPNHRDy5yVS3iOS8yvYLv3k7SEMY=
Received: by 10.78.178.5 with SMTP id a5mr2391253huf;
	Tue, 27 Jun 2006 00:16:50 -0700 (PDT)
Received: by 10.78.32.18 with HTTP; Tue, 27 Jun 2006 00:16:50 -0700 (PDT)
Message-ID: <14a4a8480606270016i1cadf037ue4818ccfecc22265@mail.gmail.com>
Date: Tue, 27 Jun 2006 15:16:50 +0800
From: "leo huang" <leo.huang.list@gmail.com>
To: "Eric Anderson" <anderson@centtech.com>
In-Reply-To: <44A09EE7.1030405@centtech.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <14a4a8480606261918q39b51f7bkd69958c5a7b05021@mail.gmail.com>
	<44A09EE7.1030405@centtech.com>
Cc: freebsd-performance@freebsd.org
Subject: Re: Is the fsync() fake on FreeBSD6.1?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jun 2006 07:36:13 -0000

Thanks Eric, Greg and Joseph.

I will get more information from freebsd-fs@freebsd.org.

2006/6/27, Eric Anderson <anderson@centtech.com>:
> leo huang wrote:
> > Hi,
> >
> > I benchmarked MySQL 4.1.18 on FreeBSD 6.1 and Debian 3.1 using Super Smack
> > 1.3 some days ago.
> >
> > The benchmark table  is
> > CREATE TABLE `Account` (
> >  `aid` int(11) NOT NULL auto_increment,
> >  `name` char(20) NOT NULL default '',
> >  `flag` int(11) NOT NULL default '0',
> >  `uidcount` int(11) NOT NULL default '0',
> >  `balance` int(11) NOT NULL default '0',
> >  `point` int(11) NOT NULL default '0',
> >  `blocktm` int(11) NOT NULL default '0',
> >  `ipnum` int(10) unsigned default NULL,
> >  `newdate` datetime default NULL,
> >  PRIMARY KEY  (`aid`),
> >  UNIQUE KEY `name` (`name`)
> > ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
> >
> > And it has 10,000,000 rows.
> >
> > The SQL statement is
> > update Account set balance= balance + 1 where aid=?;
> >
> > The result is followed:
> > OS                Clients        Result(queries per second)         TPS(got
> > from iostat)
> > FreeBSD6.1    50               516.1
> > about 2000
> > Debian3.1       50               49.8
> > about 200
> >
> > The result surprise me. The MySQL Performance on FreeBSD6.1 is about 10
> > times of on Debian3.1,and the output of iostat also shows it.
> >
> > I know that MySQL uses fsync() to flush both the data and log files at
> > default when using innodb engine(
> > http://dev.mysql.com/doc/refman/4.1/en/innodb-parameters.html). Our
> > evaluating computer only has a 10000RPM SCSI hard disk. I think it can do
> > about 200 sequential fsync() calls per second if the fsync() is real.
> >
> > Is the fsync() on FreeBSD6.1 fake? I mean than the data is only written to
> > the drives memory and so can be lost if power goes down. And how I can
> > confirm this?
> >
> > If the fsync() is fake, how can I get the real fsync?
> >
> > Any comment is welcome!
> >
> > PS:
> > 1. Our evaluating computer is DELL PowerEdge 1650?Its hardware
> > configuration
> > is followed:
> >    CPU: 2 * Intel Pentium III 1.33GHz 512KB Level 2 Cache(smp)
> >    Memory: 1024MB ECC SDRAM
> >    HD: SEAGATE ST336706LC(36GB Ultra160 SCSI 10000RPM)
> >    NIC    : Intel(R) PRO/1000 Network Connection
> >
> > 2. Some important parameters in MySQL configuration file are here:
> >    log-bin
> >    sync_binlog=1
> >    innodb_safe_binlog
> >    innodb_buffer_pool_size = 384M
> >    innodb_additional_mem_pool_size = 20M
> >    innodb_log_file_size = 100M
> >    innodb_log_buffer_size = 8M
> >    innodb_flush_log_at_trx_commit = 1
> >    innodb_lock_wait_timeout = 50
>
>
> Hi Leo,
>
> I think we've all received this message twice now, however if you are
> resending because you didn't get any responses, it's probably because
> most people on this list are here to discuss performance, and while this
> is related, it is really a filesystem question, and so many people here
> just won't know the answer for you.  You might get a better response on
> the freebsd-fs@ list.
>
> Anyway, there are a lot of variables here, but it could be softupdates.
>   Have you tried turning softupdates off on the filesystem you are
> running this on, and/or enabling the 'sync' option to the filesystem?
>
> Also, you didn't mention anything about which filesystems these were
> using on both occasions.
>
> Eric
>
>
> --
> ------------------------------------------------------------------------
> Eric Anderson        Sr. Systems Administrator        Centaur Technology
> Anything that works is better than anything that doesn't.
> ------------------------------------------------------------------------
>

From owner-freebsd-performance@FreeBSD.ORG  Tue Jun 27 07:41:53 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1A43A16A405
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 07:41:53 +0000 (UTC)
	(envelope-from arne_woerner@yahoo.com)
Received: from web30311.mail.mud.yahoo.com (web30311.mail.mud.yahoo.com
	[68.142.201.229])
	by mx1.FreeBSD.org (Postfix) with SMTP id 57B9343D69
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 07:41:46 +0000 (GMT)
	(envelope-from arne_woerner@yahoo.com)
Received: (qmail 7402 invoked by uid 60001); 27 Jun 2006 07:41:46 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
	b=wDIcNkOxpZ3cjcDVf2vAu5t4Js+BnRD1TpDnXOnCr95Sm6TN5B6hRGk7TNMbEOxS7I9OCa3t6b4hGCaZshHZUKjO1gjlSesMGa8xlIwUrOadMqzhbxxU3GG7WVwYi5EQEKRz0fJtmwISvwC+fAjjVkc4e+bQY3ontKzUXnQsT40=
	; 
Message-ID: <20060627074146.7400.qmail@web30311.mail.mud.yahoo.com>
Received: from [213.54.88.121] by web30311.mail.mud.yahoo.com via HTTP;
	Tue, 27 Jun 2006 00:41:46 PDT
Date: Tue, 27 Jun 2006 00:41:46 -0700 (PDT)
From: "R. B. Riddick" <arne_woerner@yahoo.com>
To: leo huang <leo.huang.list@gmail.com>
In-Reply-To: <20060627033412.GQ10845@wantadilla.lemis.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Cc: freebsd-performance@freebsd.org
Subject: Re: Is the fsync() fake on FreeBSD6.1?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jun 2006 07:41:53 -0000

On Tuesday, 27 June 2006 at 10:18:47 +0800, leo huang wrote:
> And how I can confirm this?
> 
You could do this test:
1. write some data with dd if=/dev/zero of=/tmp/a bs=1m count=100
2. then fsync /tmp/a
3. then listen to the hard disc
4. repeat it until u r sure, if the hard disc reacts on the fsync command...
:-)

I do not know, if it is possible to turn off the write cache of a SCSI disc...

On the freebsd-geom@ list was something about a new feature (the BIO_FLUSH
request), that can flush the write caches of hard discs. But it just works for
ata and amr discs currently... See:
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=3624+0+archive/2006/freebsd-geom/20060625.freebsd-geom

-Arne


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

From owner-freebsd-performance@FreeBSD.ORG  Tue Jun 27 08:15:37 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D4DF916A401
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 08:15:37 +0000 (UTC)
	(envelope-from leo.huang.gd@gmail.com)
Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.168])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 10B5743D69
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 08:15:32 +0000 (GMT)
	(envelope-from leo.huang.gd@gmail.com)
Received: by ug-out-1314.google.com with SMTP id m3so1394975uge
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 01:15:31 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=Y4cQ90V23PC60n0/L48D14+elORtdlksx3m0dLaY8LmBj1HAVZkin09wtnrDWdJ/FWZScPAiWurSHcMIzyiX3VD0ETdFFeKkYI20nlnVN8C2sWahv12qg1JMYYizK0Lh65xpJYd2RC94HtJ2TAfx6Q/5E3MTK9TLDrLmrbsrU5Q=
Received: by 10.67.93.6 with SMTP id v6mr5791403ugl;
	Tue, 27 Jun 2006 01:15:31 -0700 (PDT)
Received: by 10.67.27.12 with HTTP; Tue, 27 Jun 2006 01:15:31 -0700 (PDT)
Message-ID: <a0cd7c070606270115w4cc61d63xd7ecdd46b035b84c@mail.gmail.com>
Date: Tue, 27 Jun 2006 16:15:31 +0800
From: "Leo Huang" <leo.huang.gd@gmail.com>
To: "R. B. Riddick" <arne_woerner@yahoo.com>
In-Reply-To: <20060627074146.7400.qmail@web30311.mail.mud.yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <20060627033412.GQ10845@wantadilla.lemis.com>
	<20060627074146.7400.qmail@web30311.mail.mud.yahoo.com>
Cc: freebsd-performance@freebsd.org, leo huang <leo.huang.list@gmail.com>
Subject: Re: Is the fsync() fake on FreeBSD6.1?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jun 2006 08:15:37 -0000

hi, Arne,

> You could do this test:
> 1. write some data with dd if=/dev/zero of=/tmp/a bs=1m count=100
> 2. then fsync /tmp/a
> 3. then listen to the hard disc
> 4. repeat it until u r sure, if the hard disc reacts on the fsync command...
> :-)
Good idea. But the room is noisy. I think I can not hear.  :-(


> I do not know, if it is possible to turn off the write cache of a SCSI disc...
How can I hnow whether the write cache is turned off?

2006/6/27, R. B. Riddick <arne_woerner@yahoo.com>:
> On Tuesday, 27 June 2006 at 10:18:47 +0800, leo huang wrote:me
> > And how I can confirm this?
> >
> You could do this test:
> 1. write some data with dd if=/dev/zero of=/tmp/a bs=1m count=100
> 2. then fsync /tmp/a
> 3. then listen to the hard disc
> 4. repeat it until u r sure, if the hard disc reacts on the fsync command...
> :-)
>
> I do not know, if it is possible to turn off the write cache of a SCSI disc...
>
> On the freebsd-geom@ list was something about a new feature (the BIO_FLUSH
> request), that can flush the write caches of hard discs. But it just works for
> ata and amr discs currently... See:
> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=3624+0+archive/2006/freebsd-geom/20060625.freebsd-geom
>
> -Arne
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> _______________________________________________
> freebsd-performance@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
> To unsubscribe, send any mail to "freebsd-performance-unsubscribe@freebsd.org"
>

From owner-freebsd-performance@FreeBSD.ORG  Tue Jun 27 08:31:19 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 097B816A400;
	Tue, 27 Jun 2006 08:31:19 +0000 (UTC)
	(envelope-from mv@thebeastie.org)
Received: from p4.roq.com (ns1.ecoms.com [207.44.130.137])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 6851D43D73;
	Tue, 27 Jun 2006 08:31:13 +0000 (GMT)
	(envelope-from mv@thebeastie.org)
Received: from p4.roq.com (localhost.roq.com [127.0.0.1])
	by p4.roq.com (Postfix) with ESMTP id A38BB4CDA6;
	Tue, 27 Jun 2006 08:31:17 +0000 (GMT)
Received: from vaulte.jumbuck.com (ppp166-27.static.internode.on.net
	[150.101.166.27]) by p4.roq.com (Postfix) with ESMTP id EA8084CD63;
	Tue, 27 Jun 2006 08:31:16 +0000 (GMT)
Received: from vaulte.jumbuck.com (localhost [127.0.0.1])
	by vaulte.jumbuck.com (Postfix) with ESMTP id CE6C28A063;
	Tue, 27 Jun 2006 18:31:10 +1000 (EST)
Received: from [192.168.46.102] (unknown [192.168.46.250])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by vaulte.jumbuck.com (Postfix) with ESMTP id C9A1E8A062;
	Tue, 27 Jun 2006 18:31:10 +1000 (EST)
Message-ID: <44A0ECCE.6070802@thebeastie.org>
Date: Tue, 27 Jun 2006 18:31:10 +1000
From: Michael Vince <mv@thebeastie.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.7.12) Gecko/20060404
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Greg 'groggy' Lehey <grog@FreeBSD.org>
References: <14a4a8480606261918q39b51f7bkd69958c5a7b05021@mail.gmail.com>
	<20060627033412.GQ10845@wantadilla.lemis.com>
In-Reply-To: <20060627033412.GQ10845@wantadilla.lemis.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-Virus-Scanned: ClamAV using ClamSMTP
Cc: freebsd-performance@freebsd.org, leo huang <leo.huang.list@gmail.com>
Subject: Re: Is the fsync() fake on FreeBSD6.1?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jun 2006 08:31:19 -0000

Greg 'groggy' Lehey wrote:

>On Tuesday, 27 June 2006 at 10:18:47 +0800, leo huang wrote:
>  
>
>>Hi,
>>
>>I benchmarked MySQL 4.1.18 on FreeBSD 6.1 and Debian 3.1 using Super Smack
>>1.3 some days ago.
>>
>>...
>>
>>The result surprise me. The MySQL Performance on FreeBSD6.1 is about
>>10 times of on Debian3.1??and the output of iostat also shows it.
>>
>>I know that MySQL uses fsync() to flush both the data and log files
>>at default when using innodb engine(
>>http://dev.mysql.com/doc/refman/4.1/en/innodb-parameters.html). Our
>>evaluating computer only has a 10000RPM SCSI hard disk. I think it
>>can do about 200 sequential fsync() calls per second if the fsync()
>>is real.
>>
>>Is the fsync() on FreeBSD6.1 fake?
>>    
>>
>
>My understanding from the last time I looked at the code was that
>fsync does the right thing:
>
>     The fsync() system call causes all modified data and attributes of fd to
>     be moved to a permanent storage device.  This normally results in all in-
>     core modified copies of buffers for the associated file to be written to
>     a disk.
>
>This is not the case for Linux, where fsync syncs the entire file
>system.  That could explain some of the performance difference, but
>not all of it.  I suppose it's worth noting that, in general, people
>report much better performance with MySQL on Linux than on FreeBSD.
>
>  
>
>>I mean than the data is only written to the drives memory and so can
>>be lost if power goes down.
>>    
>>
>
>I don't believe that fsync is required to flush the drive buffers.  It
>would be nice to have a function that did, though.
>
>  
>
>>And how I can confirm this?
>>    
>>
>
>Trial and error?
>
>Greg
>--
>See complete headers for address and phone numbers.
>  
>
I actually tried once mounting /var/db/mysql on a md device to see if it 
could give any difference in performance.
After some supersmack benchmarking I decided it made just to a few % 
points to no difference, you are using a larger data set though in your 
benchmark maybe there would be a noticeable difference.

I saw a post a few months ago that on -current there was about a per CPU 
core 90% performance increase on a quad core system, which is very good 
scaling.
I don't know if what ever gave that kind of performance scaling got 
merged in to 6, but his numbers looked very good.

The standard Dell server I bought in late 2005 being 2 socket/2cores 
total now costs the same price as a Dell server with 2 socket ~ 4 CPU 
cores, early next year Intel will release its 4 cores per CPU so the 
standard server will be 8 cores, and will not take long to cost the same 
again.
With performance increasing at such a fast rate I have the daydream 
belief that its is going to be hard to have db performance problems in 
the future CPU wise, well at the very least its where supersmack 
benchmarks tend to be tested for.

Mike


From owner-freebsd-performance@FreeBSD.ORG  Tue Jun 27 08:42:45 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 4A23C16A413
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 08:42:45 +0000 (UTC)
	(envelope-from joseph.koshy@gmail.com)
Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.172])
	by mx1.FreeBSD.org (Postfix) with ESMTP id C35B94415E
	for <freebsd-performance@freebsd.org>;
	Tue, 27 Jun 2006 03:53:09 +0000 (GMT)
	(envelope-from joseph.koshy@gmail.com)
Received: by ug-out-1314.google.com with SMTP id m3so1315368uge
	for <freebsd-performance@freebsd.org>;
	Mon, 26 Jun 2006 20:53:08 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=ZPaeVN+zHhV/xqk+MzHqX+ElZHJ5sZwvPVtPopkevJrWJ77fSc8oAz9ART0d4/Apr1Z7hdM7mQvJEkWwBji3DshtNcNjYsCEioHrchNkUbK/KAa62NQl7k/2M3DBshluxOKEo5ELcx9q4Qm6ZIfuMI8iriTz/PAu8Y0AYwSTGDw=
Received: by 10.78.159.7 with SMTP id h7mr2355846hue;
	Mon, 26 Jun 2006 20:53:08 -0700 (PDT)
Received: by 10.78.50.15 with HTTP; Mon, 26 Jun 2006 20:53:08 -0700 (PDT)
Message-ID: <84dead720606262053m29bb923by79662b9288060892@mail.gmail.com>
Date: Tue, 27 Jun 2006 09:23:08 +0530
From: "Joseph Koshy" <joseph.koshy@gmail.com>
To: "leo huang" <leo.huang.list@gmail.com>
In-Reply-To: <14a4a8480606261918q39b51f7bkd69958c5a7b05021@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <14a4a8480606261918q39b51f7bkd69958c5a7b05021@mail.gmail.com>
Cc: freebsd-performance@freebsd.org
Subject: Re: Is the fsync() fake on FreeBSD6.1?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jun 2006 08:42:45 -0000

lh> Is the fsync() on FreeBSD6.1 fake?

It doesn't appear to be:

sys/kern/vfs_syscalls.c: 3194: fsync(td, uap)
sys/ufs/ffs/ffs_vnops.c: 175: ffs_fsync(struct vop_fsync_args *ap)

-- 
FreeBSD Volunteer,     http://people.freebsd.org/~jkoshy

From owner-freebsd-performance@FreeBSD.ORG  Thu Jun 29 23:15:00 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: performance@FreeBSD.org
Delivered-To: freebsd-performance@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AD49216AE51
	for <performance@FreeBSD.org>; Thu, 29 Jun 2006 23:15:00 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3497343D55
	for <performance@FreeBSD.org>; Thu, 29 Jun 2006 23:14:33 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 860E146C41
	for <performance@FreeBSD.org>; Thu, 29 Jun 2006 19:14:32 -0400 (EDT)
Date: Fri, 30 Jun 2006 00:14:32 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: performance@FreeBSD.org
Message-ID: <20060630001142.Y67344@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: Updated fine-grain locking patch for UNIX domain sockets
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Jun 2006 23:15:01 -0000


Attached, and at the below URL, find an updated copy of the UNIX domain socket 
fine-grained locking patch.  Since the last revision, I've updated the patch 
to close several race conditions historically present in UNIX domain sockets 
(which should be merged regardless of the rest of the patch), as well as move 
to an rwlock for the global lock.

     http://www.watson.org/~robert/freebsd/netperf/20060630-uds-fine-grain.diff

This patch increases locking overhead, but decreases contention.  Depending on 
the number of CPUs, it may improve (or not) performance to varying degrees; 
very good reports on sun4v; middling reports on 2-proc, etc.  Stability and 
performance results for UNIX domain socket sensitive workloads, such as MySQL, 
X11, etc, would be appreciated.  Micro-benchmark performance should show a 
small loss, but load under contention and scalability are (ideally) improved.

Robert N M Watson
Computer Laboratory
University of Cambridge

--- kern/uipc_usrreq.c.old	2006/06/26 16:20:56
+++ kern/uipc_usrreq.c	2006/06/29 23:09:34
@@ -54,6 +54,7 @@
  #include <sys/proc.h>
  #include <sys/protosw.h>
  #include <sys/resourcevar.h>
+#include <sys/rwlock.h>
  #include <sys/socket.h>
  #include <sys/socketvar.h>
  #include <sys/signalvar.h>
@@ -88,32 +89,124 @@
  struct mbuf *unp_addsockcred(struct thread *, struct mbuf *);

  /*
- * Currently, UNIX domain sockets are protected by a single subsystem lock,
- * which covers global data structures and variables, the contents of each
- * per-socket unpcb structure, and the so_pcb field in sockets attached to
- * the UNIX domain.  This provides for a moderate degree of paralellism, as
- * receive operations on UNIX domain sockets do not need to acquire the
- * subsystem lock.  Finer grained locking to permit send() without acquiring
- * a global lock would be a logical next step.
+ * Both send and receive buffers are allocated PIPSIZ bytes of buffering
+ * for stream sockets, although the total for sender and receiver is
+ * actually only PIPSIZ.
+ * Datagram sockets really use the sendspace as the maximum datagram size,
+ * and don't really want to reserve the sendspace.  Their recvspace should
+ * be large enough for at least one max-size datagram plus address.
+ */
+#ifndef PIPSIZ
+#define	PIPSIZ	8192
+#endif
+static u_long	unpst_sendspace = PIPSIZ;
+static u_long	unpst_recvspace = PIPSIZ;
+static u_long	unpdg_sendspace = 2*1024;	/* really max datagram size */
+static u_long	unpdg_recvspace = 4*1024;
+
+static int	unp_rights;			/* file descriptors in flight */
+
+SYSCTL_DECL(_net_local_stream);
+SYSCTL_ULONG(_net_local_stream, OID_AUTO, sendspace, CTLFLAG_RW,
+	   &unpst_sendspace, 0, "");
+SYSCTL_ULONG(_net_local_stream, OID_AUTO, recvspace, CTLFLAG_RW,
+	   &unpst_recvspace, 0, "");
+SYSCTL_DECL(_net_local_dgram);
+SYSCTL_ULONG(_net_local_dgram, OID_AUTO, maxdgram, CTLFLAG_RW,
+	   &unpdg_sendspace, 0, "");
+SYSCTL_ULONG(_net_local_dgram, OID_AUTO, recvspace, CTLFLAG_RW,
+	   &unpdg_recvspace, 0, "");
+SYSCTL_DECL(_net_local);
+SYSCTL_INT(_net_local, OID_AUTO, inflight, CTLFLAG_RD, &unp_rights, 0, "");
+
+/*-
+ * Locking and synchronization:
+ *
+ * A global UNIX domain socket rwlock protects all global variables in the
+ * implementation, as well as the linked lists tracking the set of allocated
+ * UNIX domain sockets.  These variables/fields may be read lockless using
+ * atomic operations if stale or mutually inconsistent values are
+ * permissible; otherwise the global rwlock is required to read or
+ * read-modify-write.  The global rwlock also serves to prevent deadlock when
+ * multiple PCB locks may be acquired at once (see below).  Finally, the
+ * global rwlock protects uncounted references from vnodes to sockets bound
+ * to those vnodes: to safely dereference the v_socket pointer, the global
+ * rwlock must be held while a full reference is acquired.  Some cases:
+ *
+ * - For consistent read-modify-write of global state, hold the global lock
+ *   writable.
+ *
+ * - For consistent multiple-read and non-staleness of global state, hold the
+ *   global lock writable.
+ *
+ * - To prevent changes to the global linkage, hold the global lock readable.
+ *
+ * - When modifying global linkage, hold the global lock writable.
+ *
+ * - When acquiring multiple unpcb locks at a time, hold the global lock
+ *   writable.
+ *
+ * UNIX domain sockets each have one unpcb PCB associated with them from
+ * pru_attach() to pru_detach() via the so_pcb pointer.  The validity of that
+ * reference is an invariant for the lifetime of the socket, so no lock is
+ * required to dereference the so_pcb pointer if a valid socket reference is
+ * held.
+ *
+ * Each PCB has a back-pointer to its socket, unp_socket.  This pointer may
+ * only be safely dereferenced as long as a valid reference to the PCB is
+ * held.  Typically, this reference will be from the socket, or from another
+ * PCB when the referring PCB's lock is held (in order that the reference not
+ * be invalidated during use).  In particular, to follow
+ * unp->unp_conn->unp_socket, you need unlock the lock on unp, not unp_conn.
+ *
+ * Fields of PCBs are locked using a per-unpcb lock, unp_mtx.  Individual
+ * atomic reads without the lock may be performed "lockless", but more
+ * complex reads and read-modify-writes require the mutex to be held.  No
+ * lock order is defined between PCB locks -- multiple PCB locks may be
+ * acquired at the same time only when holding the global UNIX domain socket
+ * rwlock, which prevents deadlocks.  To prevent inter-PCB references from
+ * becoming invalid, either the per-unpcb lock of the unpcb holding the
+ * reference must be held, or the global rwlock must be held readable for the
+ * lifetime of the use of the reference.
   *
- * The UNIX domain socket lock preceds all socket layer locks, including the
- * socket lock and socket buffer lock, permitting UNIX domain socket code to
- * call into socket support routines without releasing its locks.
+ * Blocking with UNIX domain sockets is a tricky issue: unlike most network
+ * protocols, bind() is a non-atomic operation, and connect() requires
+ * potential sleeping in the protocol, due to potentially waiting on local or
+ * distributed file systems.  We try to separate "lookup" operations, which
+ * may sleep, and the IPC operations themselves, which typically can occur
+ * with relative atomicity as locks can be held over the entire operation.
   *
- * Some caution is required in areas where the UNIX domain socket code enters
- * VFS in order to create or find rendezvous points.  This results in
- * dropping of the UNIX domain socket subsystem lock, acquisition of the
- * Giant lock, and potential sleeping.  This increases the chances of races,
- * and exposes weaknesses in the socket->protocol API by offering poor
- * failure modes.
+ * Another tricky issue is simultaneous multi-threaded or multi-process
+ * access to a single UNIX domain socket.  These are handled by the flags
+ * UNP_CONNECTING and UNP_BINDING.
   */
-static struct mtx unp_mtx;
-#define	UNP_LOCK_INIT() \
-	mtx_init(&unp_mtx, "unp", NULL, MTX_DEF)
-#define	UNP_LOCK()		mtx_lock(&unp_mtx)
-#define	UNP_UNLOCK()		mtx_unlock(&unp_mtx)
-#define	UNP_LOCK_ASSERT()	mtx_assert(&unp_mtx, MA_OWNED)
-#define	UNP_UNLOCK_ASSERT()	mtx_assert(&unp_mtx, MA_NOTOWNED)
+static struct rwlock	unp_global_rwlock;
+
+#define	UNP_GLOBAL_LOCK_INIT()		rw_init(&unp_global_rwlock,	\
+					    "unp_global_rwlock")
+
+#define	UNP_GLOBAL_LOCK_ASSERT()	rw_assert(&unp_global_rwlock,	\
+					    RA_LOCKED)
+#define	UNP_GLOBAL_UNLOCK_ASSERT()	rw_assert(&unp_global_rwlock,	\
+					    RA_UNLOCKED)
+
+#define	UNP_GLOBAL_WLOCK()		rw_wlock(&unp_global_rwlock)
+#define	UNP_GLOBAL_WUNLOCK()		rw_wunlock(&unp_global_rwlock)
+#define	UNP_GLOBAL_WLOCK_ASSERT()	rw_assert(&unp_global_rwlock,	\
+					    RA_WLOCKED)
+
+#define	UNP_GLOBAL_RLOCK()		rw_rlock(&unp_global_rwlock)
+#define	UNP_GLOBAL_RUNLOCK()		rw_runlock(&unp_global_rwlock)
+#define	UNP_GLOBAL_RLOCK_ASSERT()	rw_assert(&unp_global_rwlock,	\
+					    RA_RLOCKED)
+
+#define UNP_PCB_LOCK_INIT(unp)		mtx_init(&(unp)->unp_mtx,	\
+					    "unp_mtx", "unp_mtx",	\
+					    MTX_DUPOK|MTX_DEF|MTX_RECURSE)
+#define	UNP_PCB_LOCK_DESTROY(unp)	mtx_destroy(&(unp)->unp_mtx)
+#define	UNP_PCB_LOCK(unp)		mtx_lock(&(unp)->unp_mtx)
+#define	UNP_PCB_UNLOCK(unp)		mtx_unlock(&(unp)->unp_mtx)
+#define	UNP_PCB_LOCK_ASSERT(unp)	mtx_assert(&(unp)->unp_mtx, MA_OWNED)

  /*
   * Garbage collection of cyclic file descriptor/socket references occurs
@@ -123,12 +216,10 @@
   */
  static struct task	unp_gc_task;

-static int     unp_attach(struct socket *);
  static void    unp_detach(struct unpcb *);
-static int     unp_bind(struct unpcb *,struct sockaddr *, struct thread *);
  static int     unp_connect(struct socket *,struct sockaddr *, struct thread *);
  static int     unp_connect2(struct socket *so, struct socket *so2, int);
-static void    unp_disconnect(struct unpcb *);
+static void    unp_disconnect(struct unpcb *unp, struct unpcb *unp2);
  static void    unp_shutdown(struct unpcb *);
  static void    unp_drop(struct unpcb *, int);
  static void    unp_gc(__unused void *, int);
@@ -137,8 +228,6 @@
  static void    unp_discard(struct file *);
  static void    unp_freerights(struct file **, int);
  static int     unp_internalize(struct mbuf **, struct thread *);
-static int     unp_listen(struct socket *, struct unpcb *, int,
-		   struct thread *);

  static void
  uipc_abort(struct socket *so)
@@ -147,54 +236,203 @@

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_abort: unp == NULL"));
-	UNP_LOCK();
+
+	UNP_GLOBAL_WLOCK();
+	UNP_PCB_LOCK(unp);
  	unp_drop(unp, ECONNABORTED);
  	unp_detach(unp);
-	UNP_UNLOCK_ASSERT();
+	UNP_GLOBAL_UNLOCK_ASSERT();
  }

  static int
  uipc_accept(struct socket *so, struct sockaddr **nam)
  {
-	struct unpcb *unp;
+	struct unpcb *unp, *unp2;
  	const struct sockaddr *sa;

  	/*
-	 * Pass back name of connected socket,
-	 * if it was bound and we are still connected
-	 * (our peer may have closed already!).
+	 * Pass back name of connected socket, if it was bound and we are
+	 * still connected (our peer may have closed already!).
  	 */
  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_accept: unp == NULL"));
+
  	*nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK);
-	UNP_LOCK();
-	if (unp->unp_conn != NULL && unp->unp_conn->unp_addr != NULL)
-		sa = (struct sockaddr *) unp->unp_conn->unp_addr;
-	else
+	UNP_GLOBAL_RLOCK();
+	unp2 = unp->unp_conn;
+	if (unp2 != NULL && unp2->unp_addr != NULL) {
+		UNP_PCB_LOCK(unp2);
+		sa = (struct sockaddr *) unp2->unp_addr;
+		bcopy(sa, *nam, sa->sa_len);
+		UNP_PCB_UNLOCK(unp2);
+	} else {
  		sa = &sun_noname;
-	bcopy(sa, *nam, sa->sa_len);
-	UNP_UNLOCK();
+		bcopy(sa, *nam, sa->sa_len);
+	}
+	UNP_GLOBAL_RUNLOCK();
  	return (0);
  }

  static int
  uipc_attach(struct socket *so, int proto, struct thread *td)
  {
+	u_long sendspace, recvspace;
+	struct unpcb *unp;
+	int error;
+
+	KASSERT(so->so_pcb == NULL, ("uipc_attach: so_pcb != NULL"));
+	if (so->so_snd.sb_hiwat == 0 || so->so_rcv.sb_hiwat == 0) {
+		switch (so->so_type) {
+		case SOCK_STREAM:
+			sendspace = unpst_sendspace;
+			recvspace = unpst_recvspace;
+			break;
+
+		case SOCK_DGRAM:
+			sendspace = unpdg_sendspace;
+			recvspace = unpdg_recvspace;
+			break;

-	return (unp_attach(so));
+		default:
+			panic("uipc_attach");
+		}
+		error = soreserve(so, sendspace, recvspace);
+		if (error)
+			return (error);
+	}
+	unp = uma_zalloc(unp_zone, M_WAITOK | M_ZERO);
+	if (unp == NULL)
+		return (ENOBUFS);
+	LIST_INIT(&unp->unp_refs);
+	UNP_PCB_LOCK_INIT(unp);
+	unp->unp_socket = so;
+	so->so_pcb = unp;
+
+	UNP_GLOBAL_WLOCK();
+	unp->unp_gencnt = ++unp_gencnt;
+	unp_count++;
+	LIST_INSERT_HEAD(so->so_type == SOCK_DGRAM ? &unp_dhead
+			 : &unp_shead, unp, unp_link);
+	UNP_GLOBAL_WUNLOCK();
+
+	return (0);
  }

  static int
  uipc_bind(struct socket *so, struct sockaddr *nam, struct thread *td)
  {
+	struct sockaddr_un *soun = (struct sockaddr_un *)nam;
+	struct vnode *vp;
+	struct mount *mp;
+	struct vattr vattr;
+	int error, namelen;
+	struct nameidata nd;
  	struct unpcb *unp;
-	int error;
+	char *buf;

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_bind: unp == NULL"));
-	UNP_LOCK();
-	error = unp_bind(unp, nam, td);
-	UNP_UNLOCK();
+
+	namelen = soun->sun_len - offsetof(struct sockaddr_un, sun_path);
+	if (namelen <= 0)
+		return (EINVAL);
+
+	/*
+	 * We don't allow simultaneous bind() calls on a single UNIX domain
+	 * socket, so flag in-progress operations, and return an error if an
+	 * operation is already in progress.
+	 *
+	 * Historically, we have not allowed a socket to be rebound, so this
+	 * also returns an error.  Not allowing re-binding certainly
+	 * simplifies the implementation and avoids a great many possible
+	 * failure modes.
+	 */
+	UNP_PCB_LOCK(unp);
+	if (unp->unp_vnode != NULL) {
+		UNP_PCB_UNLOCK(unp);
+		return (EINVAL);
+	}
+	if (unp->unp_flags & UNP_BINDING) {
+		UNP_PCB_UNLOCK(unp);
+		return (EALREADY);
+	}
+	unp->unp_flags |= UNP_BINDING;
+	UNP_PCB_UNLOCK(unp);
+
+	buf = malloc(namelen + 1, M_TEMP, M_WAITOK);
+	strlcpy(buf, soun->sun_path, namelen + 1);
+
+	mtx_lock(&Giant);
+restart:
+	mtx_assert(&Giant, MA_OWNED);
+	NDINIT(&nd, CREATE, NOFOLLOW | LOCKPARENT | SAVENAME, UIO_SYSSPACE,
+	    buf, td);
+/* SHOULD BE ABLE TO ADOPT EXISTING AND wakeup() ALA FIFO's */
+	error = namei(&nd);
+	if (error)
+		goto error;
+	vp = nd.ni_vp;
+	if (vp != NULL || vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) {
+		NDFREE(&nd, NDF_ONLY_PNBUF);
+		if (nd.ni_dvp == vp)
+			vrele(nd.ni_dvp);
+		else
+			vput(nd.ni_dvp);
+		if (vp != NULL) {
+			vrele(vp);
+			error = EADDRINUSE;
+			goto error;
+		}
+		error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH);
+		if (error)
+			goto error;
+		goto restart;
+	}
+	VATTR_NULL(&vattr);
+	vattr.va_type = VSOCK;
+	vattr.va_mode = (ACCESSPERMS & ~td->td_proc->p_fd->fd_cmask);
+#ifdef MAC
+	error = mac_check_vnode_create(td->td_ucred, nd.ni_dvp, &nd.ni_cnd,
+	    &vattr);
+#endif
+	if (error == 0) {
+		VOP_LEASE(nd.ni_dvp, td, td->td_ucred, LEASE_WRITE);
+		error = VOP_CREATE(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr);
+	}
+	NDFREE(&nd, NDF_ONLY_PNBUF);
+	vput(nd.ni_dvp);
+	if (error) {
+		vn_finished_write(mp);
+		goto error;
+	}
+	vp = nd.ni_vp;
+	ASSERT_VOP_LOCKED(vp, "uipc_bind");
+	soun = (struct sockaddr_un *)sodupsockaddr(nam, M_WAITOK);
+
+	/*
+	 * XXXRW: handle race against another consumer also frobbing
+	 * v_socket?  Or not.
+	 */
+	UNP_GLOBAL_WLOCK();
+	UNP_PCB_LOCK(unp);
+	vp->v_socket = unp->unp_socket;
+	unp->unp_vnode = vp;
+	unp->unp_addr = soun;
+	unp->unp_flags &= ~UNP_BINDING;
+	UNP_PCB_UNLOCK(unp);
+	UNP_GLOBAL_WUNLOCK();
+	VOP_UNLOCK(vp, 0, td);
+	vn_finished_write(mp);
+	mtx_unlock(&Giant);
+	free(buf, M_TEMP);
+	return (0);
+
+error:
+	UNP_PCB_LOCK(unp);
+	unp->unp_flags &= ~UNP_BINDING;
+	UNP_PCB_UNLOCK(unp);
+	mtx_unlock(&Giant);
+	free(buf, M_TEMP);
  	return (error);
  }

@@ -204,23 +442,30 @@
  	int error;

  	KASSERT(td == curthread, ("uipc_connect: td != curthread"));
-	UNP_LOCK();
+
+	UNP_GLOBAL_WLOCK();
  	error = unp_connect(so, nam, td);
-	UNP_UNLOCK();
+	UNP_GLOBAL_WUNLOCK();
  	return (error);
  }

  int
  uipc_connect2(struct socket *so1, struct socket *so2)
  {
-	struct unpcb *unp;
+	struct unpcb *unp, *unp2;
  	int error;

-	unp = sotounpcb(so1);
+	UNP_GLOBAL_WLOCK();
+	unp = so1->so_pcb;
  	KASSERT(unp != NULL, ("uipc_connect2: unp == NULL"));
-	UNP_LOCK();
+	UNP_PCB_LOCK(unp);
+	unp2 = so2->so_pcb;
+	KASSERT(unp2 != NULL, ("uipc_connect2: unp2 == NULL"));
+	UNP_PCB_LOCK(unp2);
  	error = unp_connect2(so1, so2, PRU_CONNECT2);
-	UNP_UNLOCK();
+	UNP_PCB_UNLOCK(unp2);
+	UNP_PCB_UNLOCK(unp);
+	UNP_GLOBAL_WUNLOCK();
  	return (error);
  }

@@ -233,21 +478,31 @@

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_detach: unp == NULL"));
-	UNP_LOCK();
+
+	UNP_GLOBAL_WLOCK();
+	UNP_PCB_LOCK(unp);
  	unp_detach(unp);
-	UNP_UNLOCK_ASSERT();
+	UNP_GLOBAL_UNLOCK_ASSERT();
  }

  static int
  uipc_disconnect(struct socket *so)
  {
-	struct unpcb *unp;
+	struct unpcb *unp, *unp2;

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_disconnect: unp == NULL"));
-	UNP_LOCK();
-	unp_disconnect(unp);
-	UNP_UNLOCK();
+
+	UNP_GLOBAL_WLOCK();
+	UNP_PCB_LOCK(unp);
+	unp2 = unp->unp_conn;
+	if (unp2 != NULL) {
+		UNP_PCB_LOCK(unp2);
+		unp_disconnect(unp, unp2);
+		UNP_PCB_UNLOCK(unp2);
+	}
+	UNP_PCB_UNLOCK(unp);
+	UNP_GLOBAL_WUNLOCK();
  	return (0);
  }

@@ -259,81 +514,108 @@

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_listen: unp == NULL"));
-	UNP_LOCK();
+
+	UNP_PCB_LOCK(unp);
  	if (unp->unp_vnode == NULL) {
-		UNP_UNLOCK();
+		UNP_PCB_UNLOCK(unp);
  		return (EINVAL);
  	}
-	error = unp_listen(so, unp, backlog, td);
-	UNP_UNLOCK();
+
+	SOCK_LOCK(so);
+	error = solisten_proto_check(so);
+	if (error == 0) {
+		cru2x(td->td_ucred, &unp->unp_peercred);
+		unp->unp_flags |= UNP_HAVEPCCACHED;
+		solisten_proto(so, backlog);
+	}
+	SOCK_UNLOCK(so);
+	UNP_PCB_UNLOCK(unp);
  	return (error);
  }

  static int
  uipc_peeraddr(struct socket *so, struct sockaddr **nam)
  {
-	struct unpcb *unp;
+	struct unpcb *unp, *unp2;
  	const struct sockaddr *sa;

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_peeraddr: unp == NULL"));
+
  	*nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK);
-	UNP_LOCK();
-	if (unp->unp_conn != NULL && unp->unp_conn->unp_addr!= NULL)
-		sa = (struct sockaddr *) unp->unp_conn->unp_addr;
-	else {
-		/*
-		 * XXX: It seems that this test always fails even when
-		 * connection is established.  So, this else clause is
-		 * added as workaround to return PF_LOCAL sockaddr.
-		 */
+	UNP_PCB_LOCK(unp);
+	/*
+	 * XXX: It seems that this test always fails even when connection is
+	 * established.  So, this else clause is added as workaround to
+	 * return PF_LOCAL sockaddr.
+	 */
+	unp2 = unp->unp_conn;
+	if (unp2 != NULL) {
+		UNP_PCB_LOCK(unp2);
+		if (unp2->unp_addr != NULL)
+			sa = (struct sockaddr *) unp->unp_conn->unp_addr;
+		else
+			sa = &sun_noname;
+		bcopy(sa, *nam, sa->sa_len);
+		UNP_PCB_UNLOCK(unp2);
+	} else {
  		sa = &sun_noname;
+		bcopy(sa, *nam, sa->sa_len);
  	}
-	bcopy(sa, *nam, sa->sa_len);
-	UNP_UNLOCK();
+	UNP_PCB_UNLOCK(unp);
  	return (0);
  }

  static int
  uipc_rcvd(struct socket *so, int flags)
  {
-	struct unpcb *unp;
+	struct unpcb *unp, *unp2;
  	struct socket *so2;
+	u_int mbcnt, sbcc;
  	u_long newhiwat;

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_rcvd: unp == NULL"));
-	UNP_LOCK();
-	switch (so->so_type) {
-	case SOCK_DGRAM:
+
+	if (so->so_type == SOCK_DGRAM)
  		panic("uipc_rcvd DGRAM?");
-		/*NOTREACHED*/

-	case SOCK_STREAM:
-		if (unp->unp_conn == NULL)
-			break;
-		so2 = unp->unp_conn->unp_socket;
-		SOCKBUF_LOCK(&so2->so_snd);
-		SOCKBUF_LOCK(&so->so_rcv);
-		/*
-		 * Adjust backpressure on sender
-		 * and wakeup any waiting to write.
-		 */
-		so2->so_snd.sb_mbmax += unp->unp_mbcnt - so->so_rcv.sb_mbcnt;
-		unp->unp_mbcnt = so->so_rcv.sb_mbcnt;
-		newhiwat = so2->so_snd.sb_hiwat + unp->unp_cc -
-		    so->so_rcv.sb_cc;
-		(void)chgsbsize(so2->so_cred->cr_uidinfo, &so2->so_snd.sb_hiwat,
-		    newhiwat, RLIM_INFINITY);
-		unp->unp_cc = so->so_rcv.sb_cc;
-		SOCKBUF_UNLOCK(&so->so_rcv);
-		sowwakeup_locked(so2);
-		break;
+	if (so->so_type != SOCK_STREAM)
+		panic("uipc_rcvd unknown socktype");

-	default:
-		panic("uipc_rcvd unknown socktype");
+	/*
+	 * Adjust backpressure on sender and wakeup any waiting to write.
+	 *
+	 * The consistency requirements here are a bit complex: we must
+	 * acquire the lock for our own unpcb in order to prevent it from
+	 * disconnecting while in use, changing the unp_conn peer.  We do not
+	 * need unp2's lock, since the unp2->unp_socket pointer will remain
+	 * static as long as the unp2 pcb is valid, which it will be until we
+	 * release unp's lock to allow a disconnect.  We do need socket
+	 * mutexes for both socket endpoints since we manipulate fields in
+	 * both; we hold both locks at once since we access both
+	 * simultaneously.
+	 */
+	SOCKBUF_LOCK(&so->so_rcv);
+	mbcnt = so->so_rcv.sb_mbcnt;
+	sbcc = so->so_rcv.sb_cc;
+	SOCKBUF_UNLOCK(&so->so_rcv);
+	UNP_PCB_LOCK(unp);
+	unp2 = unp->unp_conn;
+	if (unp2 == NULL) {
+		UNP_PCB_UNLOCK(unp);
+		return (0);
  	}
-	UNP_UNLOCK();
+	so2 = unp2->unp_socket;
+	SOCKBUF_LOCK(&so2->so_snd);
+	so2->so_snd.sb_mbmax += unp->unp_mbcnt - mbcnt;
+	newhiwat = so2->so_snd.sb_hiwat + unp->unp_cc - sbcc;
+	(void)chgsbsize(so2->so_cred->cr_uidinfo, &so2->so_snd.sb_hiwat,
+	    newhiwat, RLIM_INFINITY);
+	sowwakeup_locked(so2);
+	unp->unp_cc = sbcc;
+	unp->unp_mbcnt = mbcnt;
+	UNP_PCB_UNLOCK(unp);
  	return (0);
  }

@@ -343,13 +625,15 @@
  uipc_send(struct socket *so, int flags, struct mbuf *m, struct sockaddr *nam,
      struct mbuf *control, struct thread *td)
  {
-	int error = 0;
-	struct unpcb *unp;
+	struct unpcb *unp, *unp2;
  	struct socket *so2;
+	u_int mbcnt, sbcc;
  	u_long newhiwat;
+	int error = 0;

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_send: unp == NULL"));
+
  	if (flags & PRUS_OOB) {
  		error = EOPNOTSUPP;
  		goto release;
@@ -358,32 +642,44 @@
  	if (control != NULL && (error = unp_internalize(&control, td)))
  		goto release;

-	UNP_LOCK();
+	if ((nam != NULL) || (flags & PRUS_EOF))
+		UNP_GLOBAL_WLOCK();
+	else
+		UNP_GLOBAL_RLOCK();
+
  	switch (so->so_type) {
  	case SOCK_DGRAM:
  	{
  		const struct sockaddr *from;

+		unp2 = unp->unp_conn;
  		if (nam != NULL) {
-			if (unp->unp_conn != NULL) {
+			if (unp2 != NULL) {
  				error = EISCONN;
+				UNP_PCB_LOCK(unp);
  				break;
  			}
  			error = unp_connect(so, nam, td);
+			UNP_PCB_LOCK(unp);
  			if (error)
  				break;
+			unp2 = unp->unp_conn;
  		} else {
-			if (unp->unp_conn == NULL) {
+			UNP_PCB_LOCK(unp);
+			if (unp2 == NULL) {
  				error = ENOTCONN;
+				UNP_PCB_LOCK(unp);
  				break;
  			}
  		}
-		so2 = unp->unp_conn->unp_socket;
+		UNP_PCB_LOCK_ASSERT(unp);
+		UNP_PCB_LOCK(unp2);
+		so2 = unp2->unp_socket;
  		if (unp->unp_addr != NULL)
  			from = (struct sockaddr *)unp->unp_addr;
  		else
  			from = &sun_noname;
-		if (unp->unp_conn->unp_flags & UNP_WANTCRED)
+		if (unp2->unp_flags & UNP_WANTCRED)
  			control = unp_addsockcred(td, control);
  		SOCKBUF_LOCK(&so2->so_rcv);
  		if (sbappendaddr_locked(&so2->so_rcv, from, m, control)) {
@@ -395,43 +691,57 @@
  			error = ENOBUFS;
  		}
  		if (nam != NULL)
-			unp_disconnect(unp);
+			unp_disconnect(unp, unp2);
+		UNP_PCB_UNLOCK(unp2);
  		break;
  	}

  	case SOCK_STREAM:
  		/* Connect if not connected yet. */
  		/*
-		 * Note: A better implementation would complain
-		 * if not equal to the peer's address.
+		 * Note: A better implementation would complain if not equal
+		 * to the peer's address.
  		 */
  		if ((so->so_state & SS_ISCONNECTED) == 0) {
  			if (nam != NULL) {
  				error = unp_connect(so, nam, td);
+				UNP_PCB_LOCK(unp);
  				if (error)
  					break;	/* XXX */
  			} else {
  				error = ENOTCONN;
+				UNP_PCB_LOCK(unp);
  				break;
  			}
-		}
+		} else
+			UNP_PCB_LOCK(unp);
+		UNP_PCB_LOCK_ASSERT(unp);

-		SOCKBUF_LOCK(&so->so_snd);
  		if (so->so_snd.sb_state & SBS_CANTSENDMORE) {
-			SOCKBUF_UNLOCK(&so->so_snd);
  			error = EPIPE;
  			break;
  		}
-		if (unp->unp_conn == NULL)
-			panic("uipc_send connected but no connection?");
-		so2 = unp->unp_conn->unp_socket;
+		/*
+		 * Lock order here has to be handled carefully: we hold the
+		 * global lock, so acquiring two unpcb locks is OK.  We must
+		 * acquire both before acquiring any socket mutexes.  We must
+		 * also acquire the local socket send mutex before the remote
+		 * socket receive mutex.  The only tricky thing is making
+		 * sure to acquire the unp2 lock before the local socket send
+		 * lock, or we will experience deadlocks.
+		 */
+		unp2 = unp->unp_conn;
+		KASSERT(unp2 != NULL,
+		    ("uipc_send connected but no connection?"));
+		UNP_PCB_LOCK(unp2);
+		so2 = unp2->unp_socket;
  		SOCKBUF_LOCK(&so2->so_rcv);
-		if (unp->unp_conn->unp_flags & UNP_WANTCRED) {
+		if (unp2->unp_flags & UNP_WANTCRED) {
  			/*
  			 * Credentials are passed only once on
  			 * SOCK_STREAM.
  			 */
-			unp->unp_conn->unp_flags &= ~UNP_WANTCRED;
+			unp2->unp_flags &= ~UNP_WANTCRED;
  			control = unp_addsockcred(td, control);
  		}
  		/*
@@ -442,19 +752,20 @@
  		if (control != NULL) {
  			if (sbappendcontrol_locked(&so2->so_rcv, m, control))
  				control = NULL;
-		} else {
+		} else
  			sbappend_locked(&so2->so_rcv, m);
-		}
-		so->so_snd.sb_mbmax -=
-			so2->so_rcv.sb_mbcnt - unp->unp_conn->unp_mbcnt;
-		unp->unp_conn->unp_mbcnt = so2->so_rcv.sb_mbcnt;
-		newhiwat = so->so_snd.sb_hiwat -
-		    (so2->so_rcv.sb_cc - unp->unp_conn->unp_cc);
+		mbcnt = so2->so_rcv.sb_mbcnt;
+		sbcc = so2->so_rcv.sb_cc;
+		sorwakeup_locked(so2);
+		SOCKBUF_LOCK(&so->so_snd);
+		newhiwat = so->so_snd.sb_hiwat - (sbcc - unp2->unp_cc);
  		(void)chgsbsize(so->so_cred->cr_uidinfo, &so->so_snd.sb_hiwat,
  		    newhiwat, RLIM_INFINITY);
+		so->so_snd.sb_mbmax -= mbcnt - unp->unp_conn->unp_mbcnt;
  		SOCKBUF_UNLOCK(&so->so_snd);
-		unp->unp_conn->unp_cc = so2->so_rcv.sb_cc;
-		sorwakeup_locked(so2);
+		unp2->unp_mbcnt = mbcnt;
+		unp2->unp_cc = sbcc;
+		UNP_PCB_UNLOCK(unp2);
  		m = NULL;
  		break;

@@ -470,7 +781,12 @@
  		socantsendmore(so);
  		unp_shutdown(unp);
  	}
-	UNP_UNLOCK();
+	UNP_PCB_UNLOCK(unp);
+
+	if ((nam != NULL) || (flags & PRUS_EOF))
+		UNP_GLOBAL_WUNLOCK();
+	else
+		UNP_GLOBAL_RUNLOCK();

  	if (control != NULL && error != 0)
  		unp_dispose(control);
@@ -486,22 +802,26 @@
  static int
  uipc_sense(struct socket *so, struct stat *sb)
  {
-	struct unpcb *unp;
+	struct unpcb *unp, *unp2;
  	struct socket *so2;

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_sense: unp == NULL"));
-	UNP_LOCK();
+
  	sb->st_blksize = so->so_snd.sb_hiwat;
-	if (so->so_type == SOCK_STREAM && unp->unp_conn != NULL) {
-		so2 = unp->unp_conn->unp_socket;
+	UNP_GLOBAL_RLOCK();
+	UNP_PCB_LOCK(unp);
+	unp2 = unp->unp_conn;
+	if (so->so_type == SOCK_STREAM && unp2 != NULL) {
+		so2 = unp2->unp_socket;
  		sb->st_blksize += so2->so_rcv.sb_cc;
  	}
  	sb->st_dev = NODEV;
  	if (unp->unp_ino == 0)
  		unp->unp_ino = (++unp_ino == 0) ? ++unp_ino : unp_ino;
  	sb->st_ino = unp->unp_ino;
-	UNP_UNLOCK();
+	UNP_PCB_UNLOCK(unp);
+	UNP_GLOBAL_RUNLOCK();
  	return (0);
  }

@@ -512,10 +832,13 @@

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_shutdown: unp == NULL"));
-	UNP_LOCK();
+
+	UNP_GLOBAL_WLOCK();
+	UNP_PCB_LOCK(unp);
  	socantsendmore(so);
  	unp_shutdown(unp);
-	UNP_UNLOCK();
+	UNP_PCB_UNLOCK(unp);
+	UNP_GLOBAL_WUNLOCK();
  	return (0);
  }

@@ -527,14 +850,15 @@

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_sockaddr: unp == NULL"));
+
  	*nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK);
-	UNP_LOCK();
+	UNP_PCB_LOCK(unp);
  	if (unp->unp_addr != NULL)
  		sa = (struct sockaddr *) unp->unp_addr;
  	else
  		sa = &sun_noname;
  	bcopy(sa, *nam, sa->sa_len);
-	UNP_UNLOCK();
+	UNP_PCB_UNLOCK(unp);
  	return (0);
  }

@@ -571,12 +895,13 @@

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("uipc_ctloutput: unp == NULL"));
-	UNP_LOCK();
+
  	error = 0;
  	switch (sopt->sopt_dir) {
  	case SOPT_GET:
  		switch (sopt->sopt_name) {
  		case LOCAL_PEERCRED:
+			UNP_PCB_LOCK(unp);
  			if (unp->unp_flags & UNP_HAVEPC)
  				xu = unp->unp_peercred;
  			else {
@@ -585,22 +910,31 @@
  				else
  					error = EINVAL;
  			}
+			UNP_PCB_UNLOCK(unp);
  			if (error == 0)
  				error = sooptcopyout(sopt, &xu, sizeof(xu));
  			break;
+
  		case LOCAL_CREDS:
+			UNP_PCB_LOCK(unp);
  			optval = unp->unp_flags & UNP_WANTCRED ? 1 : 0;
+			UNP_PCB_UNLOCK(unp);
  			error = sooptcopyout(sopt, &optval, sizeof(optval));
  			break;
+
  		case LOCAL_CONNWAIT:
+			UNP_PCB_LOCK(unp);
  			optval = unp->unp_flags & UNP_CONNWAIT ? 1 : 0;
+			UNP_PCB_UNLOCK(unp);
  			error = sooptcopyout(sopt, &optval, sizeof(optval));
  			break;
+
  		default:
  			error = EOPNOTSUPP;
  			break;
  		}
  		break;
+
  	case SOPT_SET:
  		switch (sopt->sopt_name) {
  		case LOCAL_CREDS:
@@ -610,19 +944,24 @@
  			if (error)
  				break;

-#define	OPTSET(bit) \
-	if (optval) \
-		unp->unp_flags |= bit; \
-	else \
-		unp->unp_flags &= ~bit;
+#define	OPTSET(bit) do {						\
+	UNP_PCB_LOCK(unp);						\
+	if (optval)							\
+		unp->unp_flags |= bit;					\
+	else								\
+		unp->unp_flags &= ~bit;					\
+	UNP_PCB_UNLOCK(unp);						\
+} while (0)

  			switch (sopt->sopt_name) {
  			case LOCAL_CREDS:
  				OPTSET(UNP_WANTCRED);
  				break;
+
  			case LOCAL_CONNWAIT:
  				OPTSET(UNP_CONNWAIT);
  				break;
+
  			default:
  				break;
  			}
@@ -633,117 +972,60 @@
  			break;
  		}
  		break;
+
  	default:
  		error = EOPNOTSUPP;
  		break;
  	}
-	UNP_UNLOCK();
  	return (error);
  }

-/*
- * Both send and receive buffers are allocated PIPSIZ bytes of buffering
- * for stream sockets, although the total for sender and receiver is
- * actually only PIPSIZ.
- * Datagram sockets really use the sendspace as the maximum datagram size,
- * and don't really want to reserve the sendspace.  Their recvspace should
- * be large enough for at least one max-size datagram plus address.
- */
-#ifndef PIPSIZ
-#define	PIPSIZ	8192
-#endif
-static u_long	unpst_sendspace = PIPSIZ;
-static u_long	unpst_recvspace = PIPSIZ;
-static u_long	unpdg_sendspace = 2*1024;	/* really max datagram size */
-static u_long	unpdg_recvspace = 4*1024;
-
-static int	unp_rights;			/* file descriptors in flight */
-
-SYSCTL_DECL(_net_local_stream);
-SYSCTL_ULONG(_net_local_stream, OID_AUTO, sendspace, CTLFLAG_RW,
-	   &unpst_sendspace, 0, "");
-SYSCTL_ULONG(_net_local_stream, OID_AUTO, recvspace, CTLFLAG_RW,
-	   &unpst_recvspace, 0, "");
-SYSCTL_DECL(_net_local_dgram);
-SYSCTL_ULONG(_net_local_dgram, OID_AUTO, maxdgram, CTLFLAG_RW,
-	   &unpdg_sendspace, 0, "");
-SYSCTL_ULONG(_net_local_dgram, OID_AUTO, recvspace, CTLFLAG_RW,
-	   &unpdg_recvspace, 0, "");
-SYSCTL_DECL(_net_local);
-SYSCTL_INT(_net_local, OID_AUTO, inflight, CTLFLAG_RD, &unp_rights, 0, "");
-
-static int
-unp_attach(struct socket *so)
-{
-	struct unpcb *unp;
-	int error;
-
-	KASSERT(so->so_pcb == NULL, ("unp_attach: so_pcb != NULL"));
-	if (so->so_snd.sb_hiwat == 0 || so->so_rcv.sb_hiwat == 0) {
-		switch (so->so_type) {
-
-		case SOCK_STREAM:
-			error = soreserve(so, unpst_sendspace, unpst_recvspace);
-			break;
-
-		case SOCK_DGRAM:
-			error = soreserve(so, unpdg_sendspace, unpdg_recvspace);
-			break;
-
-		default:
-			panic("unp_attach");
-		}
-		if (error)
-			return (error);
-	}
-	unp = uma_zalloc(unp_zone, M_WAITOK | M_ZERO);
-	if (unp == NULL)
-		return (ENOBUFS);
-	LIST_INIT(&unp->unp_refs);
-	unp->unp_socket = so;
-	so->so_pcb = unp;
-
-	UNP_LOCK();
-	unp->unp_gencnt = ++unp_gencnt;
-	unp_count++;
-	LIST_INSERT_HEAD(so->so_type == SOCK_DGRAM ? &unp_dhead
-			 : &unp_shead, unp, unp_link);
-	UNP_UNLOCK();
-
-	return (0);
-}
-
  static void
  unp_detach(struct unpcb *unp)
  {
+	int local_unp_rights;
  	struct vnode *vp;
-	int local_unp_rights;
+	struct unpcb *unp2;

-	UNP_LOCK_ASSERT();
+	UNP_GLOBAL_WLOCK_ASSERT();
+	UNP_PCB_LOCK_ASSERT(unp);

  	LIST_REMOVE(unp, unp_link);
  	unp->unp_gencnt = ++unp_gencnt;
  	--unp_count;
+
+	/*
+	 * XXXRW: What if v_socket != our soket?
+	 */
  	if ((vp = unp->unp_vnode) != NULL) {
-		/*
-		 * XXXRW: should v_socket be frobbed only while holding
-		 * Giant?
-		 */
  		unp->unp_vnode->v_socket = NULL;
  		unp->unp_vnode = NULL;
  	}
-	if (unp->unp_conn != NULL)
-		unp_disconnect(unp);
+	unp2 = unp->unp_conn;
+	if (unp2 != NULL) {
+		UNP_PCB_LOCK(unp2);
+		unp_disconnect(unp, unp2);
+		UNP_PCB_UNLOCK(unp2);
+	}
+
+	/*
+	 * We hold the global lock, so it's OK to acquire multiple pcb locks
+	 * at a time.
+	 */
  	while (!LIST_EMPTY(&unp->unp_refs)) {
  		struct unpcb *ref = LIST_FIRST(&unp->unp_refs);
+
+		UNP_PCB_LOCK(ref);
  		unp_drop(ref, ECONNRESET);
+		UNP_PCB_UNLOCK(ref);
  	}
+	UNP_GLOBAL_WUNLOCK();
  	soisdisconnected(unp->unp_socket);
  	unp->unp_socket->so_pcb = NULL;
  	local_unp_rights = unp_rights;
-	UNP_UNLOCK();
  	if (unp->unp_addr != NULL)
  		FREE(unp->unp_addr, M_SONAME);
+	UNP_PCB_LOCK_DESTROY(unp);
  	uma_zfree(unp_zone, unp);
  	if (vp) {
  		int vfslocked;
@@ -757,97 +1039,6 @@
  }

  static int
-unp_bind(struct unpcb *unp, struct sockaddr *nam, struct thread *td)
-{
-	struct sockaddr_un *soun = (struct sockaddr_un *)nam;
-	struct vnode *vp;
-	struct mount *mp;
-	struct vattr vattr;
-	int error, namelen;
-	struct nameidata nd;
-	char *buf;
-
-	UNP_LOCK_ASSERT();
-
-	/*
-	 * XXXRW: This test-and-set of unp_vnode is non-atomic; the
-	 * unlocked read here is fine, but the value of unp_vnode needs
-	 * to be tested again after we do all the lookups to see if the
-	 * pcb is still unbound?
-	 */
-	if (unp->unp_vnode != NULL)
-		return (EINVAL);
-
-	namelen = soun->sun_len - offsetof(struct sockaddr_un, sun_path);
-	if (namelen <= 0)
-		return (EINVAL);
-
-	UNP_UNLOCK();
-
-	buf = malloc(namelen + 1, M_TEMP, M_WAITOK);
-	strlcpy(buf, soun->sun_path, namelen + 1);
-
-	mtx_lock(&Giant);
-restart:
-	mtx_assert(&Giant, MA_OWNED);
-	NDINIT(&nd, CREATE, NOFOLLOW | LOCKPARENT | SAVENAME, UIO_SYSSPACE,
-	    buf, td);
-/* SHOULD BE ABLE TO ADOPT EXISTING AND wakeup() ALA FIFO's */
-	error = namei(&nd);
-	if (error)
-		goto done;
-	vp = nd.ni_vp;
-	if (vp != NULL || vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) {
-		NDFREE(&nd, NDF_ONLY_PNBUF);
-		if (nd.ni_dvp == vp)
-			vrele(nd.ni_dvp);
-		else
-			vput(nd.ni_dvp);
-		if (vp != NULL) {
-			vrele(vp);
-			error = EADDRINUSE;
-			goto done;
-		}
-		error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH);
-		if (error)
-			goto done;
-		goto restart;
-	}
-	VATTR_NULL(&vattr);
-	vattr.va_type = VSOCK;
-	vattr.va_mode = (ACCESSPERMS & ~td->td_proc->p_fd->fd_cmask);
-#ifdef MAC
-	error = mac_check_vnode_create(td->td_ucred, nd.ni_dvp, &nd.ni_cnd,
-	    &vattr);
-#endif
-	if (error == 0) {
-		VOP_LEASE(nd.ni_dvp, td, td->td_ucred, LEASE_WRITE);
-		error = VOP_CREATE(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr);
-	}
-	NDFREE(&nd, NDF_ONLY_PNBUF);
-	vput(nd.ni_dvp);
-	if (error) {
-		vn_finished_write(mp);
-		goto done;
-	}
-	vp = nd.ni_vp;
-	ASSERT_VOP_LOCKED(vp, "unp_bind");
-	soun = (struct sockaddr_un *)sodupsockaddr(nam, M_WAITOK);
-	UNP_LOCK();
-	vp->v_socket = unp->unp_socket;
-	unp->unp_vnode = vp;
-	unp->unp_addr = soun;
-	UNP_UNLOCK();
-	VOP_UNLOCK(vp, 0, td);
-	vn_finished_write(mp);
-done:
-	mtx_unlock(&Giant);
-	free(buf, M_TEMP);
-	UNP_LOCK();
-	return (error);
-}
-
-static int
  unp_connect(struct socket *so, struct sockaddr *nam, struct thread *td)
  {
  	struct sockaddr_un *soun = (struct sockaddr_un *)nam;
@@ -859,15 +1050,25 @@
  	char buf[SOCK_MAXADDRLEN];
  	struct sockaddr *sa;

-	UNP_LOCK_ASSERT();
+	UNP_GLOBAL_WLOCK_ASSERT();
+	UNP_GLOBAL_WUNLOCK();

  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("unp_connect: unp == NULL"));
+
  	len = nam->sa_len - offsetof(struct sockaddr_un, sun_path);
  	if (len <= 0)
  		return (EINVAL);
  	strlcpy(buf, soun->sun_path, len + 1);
-	UNP_UNLOCK();
+
+	UNP_PCB_LOCK(unp);
+	if (unp->unp_flags & UNP_CONNECTING) {
+		UNP_PCB_UNLOCK(unp);
+		return (EALREADY);
+	}
+	unp->unp_flags |= UNP_CONNECTING;
+	UNP_PCB_UNLOCK(unp);
+
  	sa = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK);
  	mtx_lock(&Giant);
  	NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE, buf, td);
@@ -889,9 +1090,15 @@
  	if (error)
  		goto bad;
  	mtx_unlock(&Giant);
-	UNP_LOCK();
+
  	unp = sotounpcb(so);
  	KASSERT(unp != NULL, ("unp_connect: unp == NULL"));
+
+	/*
+	 * Lock global lock for two reasons: make sure v_socket is stable,
+	 * and to protect simultaneous locking of multiple pcbs.
+	 */
+	UNP_GLOBAL_WLOCK();
  	so2 = vp->v_socket;
  	if (so2 == NULL) {
  		error = ECONNREFUSED;
@@ -904,14 +1111,18 @@
  	if (so->so_proto->pr_flags & PR_CONNREQUIRED) {
  		if (so2->so_options & SO_ACCEPTCONN) {
  			/*
-			 * NB: drop locks here so unp_attach is entered
+			 * NB: drop locks here so uipc_attach is entered
  			 *     w/o locks; this avoids a recursive lock
  			 *     of the head and holding sleep locks across
  			 *     a (potentially) blocking malloc.
+			 *
+			 * XXXRW: This is actually a non-blocking alloc.
+			 * Lock order issue is real.  Releasing lock here
+			 * may invalidate so2 pointer, however.
  			 */
-			UNP_UNLOCK();
+			UNP_GLOBAL_WUNLOCK();
  			so3 = sonewconn(so2, 0);
-			UNP_LOCK();
+			UNP_GLOBAL_WLOCK();
  		} else
  			so3 = NULL;
  		if (so3 == NULL) {
@@ -921,6 +1132,9 @@
  		unp = sotounpcb(so);
  		unp2 = sotounpcb(so2);
  		unp3 = sotounpcb(so3);
+		UNP_PCB_LOCK(unp);
+		UNP_PCB_LOCK(unp2);
+		UNP_PCB_LOCK(unp3);
  		if (unp2->unp_addr != NULL) {
  			bcopy(unp2->unp_addr, sa, unp2->unp_addr->sun_len);
  			unp3->unp_addr = (struct sockaddr_un *) sa;
@@ -938,7 +1152,7 @@
  		/*
  		 * The receiver's (server's) credentials are copied
  		 * from the unp_peercred member of socket on which the
-		 * former called listen(); unp_listen() cached that
+		 * former called listen(); uipc_listen() cached that
  		 * process's credentials at that time so we can use
  		 * them now.
  		 */
@@ -949,6 +1163,9 @@
  		unp->unp_flags |= UNP_HAVEPC;
  		if (unp2->unp_flags & UNP_WANTCRED)
  			unp3->unp_flags |= UNP_WANTCRED;
+		UNP_PCB_UNLOCK(unp3);
+		UNP_PCB_UNLOCK(unp2);
+		UNP_PCB_UNLOCK(unp);
  #ifdef MAC
  		SOCK_LOCK(so);
  		mac_set_socket_peer_from_socket(so, so3);
@@ -958,9 +1175,17 @@

  		so2 = so3;
  	}
+	unp = sotounpcb(so);
+	KASSERT(unp != NULL, ("unp_connect: unp == NULL"));
+	unp2 = sotounpcb(so2);
+	KASSERT(unp2 != NULL, ("unp_connect: unp2 == NULL"));
+	UNP_PCB_LOCK(unp);
+	UNP_PCB_LOCK(unp2);
  	error = unp_connect2(so, so2, PRU_CONNECT);
+	UNP_PCB_UNLOCK(unp2);
+	UNP_PCB_UNLOCK(unp);
  bad2:
-	UNP_UNLOCK();
+	UNP_GLOBAL_WUNLOCK();
  	mtx_lock(&Giant);
  bad:
  	mtx_assert(&Giant, MA_OWNED);
@@ -968,25 +1193,33 @@
  		vput(vp);
  	mtx_unlock(&Giant);
  	free(sa, M_SONAME);
-	UNP_LOCK();
+	UNP_GLOBAL_WLOCK();
+	UNP_PCB_LOCK(unp);
+	unp->unp_flags &= ~UNP_CONNECTING;
+	UNP_PCB_UNLOCK(unp);
  	return (error);
  }

  static int
  unp_connect2(struct socket *so, struct socket *so2, int req)
  {
-	struct unpcb *unp = sotounpcb(so);
+	struct unpcb *unp;
  	struct unpcb *unp2;

-	UNP_LOCK_ASSERT();
+	unp = sotounpcb(so);
+	KASSERT(unp != NULL, ("unp_connect2: unp == NULL"));
+	unp2 = sotounpcb(so2);
+	KASSERT(unp2 != NULL, ("unp_connect2: unp2 == NULL"));
+
+	UNP_GLOBAL_WLOCK_ASSERT();
+	UNP_PCB_LOCK_ASSERT(unp);
+	UNP_PCB_LOCK_ASSERT(unp2);

  	if (so2->so_type != so->so_type)
  		return (EPROTOTYPE);
-	unp2 = sotounpcb(so2);
-	KASSERT(unp2 != NULL, ("unp_connect2: unp2 == NULL"));
  	unp->unp_conn = unp2;
+
  	switch (so->so_type) {
-
  	case SOCK_DGRAM:
  		LIST_INSERT_HEAD(&unp2->unp_refs, unp, unp_reflink);
  		soisconnected(so);
@@ -1009,15 +1242,16 @@
  }

  static void
-unp_disconnect(struct unpcb *unp)
+unp_disconnect(struct unpcb *unp, struct unpcb *unp2)
  {
-	struct unpcb *unp2 = unp->unp_conn;
  	struct socket *so;

-	UNP_LOCK_ASSERT();
+	KASSERT(unp2 != NULL, ("unp_disconnect: unp2 == NULL"));
+
+	UNP_GLOBAL_WLOCK_ASSERT();
+	UNP_PCB_LOCK_ASSERT(unp);
+	UNP_PCB_LOCK_ASSERT(unp2);

-	if (unp2 == NULL)
-		return;
  	unp->unp_conn = NULL;
  	switch (unp->unp_socket->so_type) {
  	case SOCK_DGRAM:
@@ -1074,10 +1308,10 @@
  	 * OK, now we're committed to doing something.
  	 */
  	xug = malloc(sizeof(*xug), M_TEMP, M_WAITOK);
-	UNP_LOCK();
+	UNP_GLOBAL_RLOCK();
  	gencnt = unp_gencnt;
  	n = unp_count;
-	UNP_UNLOCK();
+	UNP_GLOBAL_RUNLOCK();

  	xug->xug_len = sizeof *xug;
  	xug->xug_count = n;
@@ -1091,23 +1325,34 @@

  	unp_list = malloc(n * sizeof *unp_list, M_TEMP, M_WAITOK);

-	UNP_LOCK();
+	/*
+	 * XXXRW: Note, this code relies very explicitly in pcb's being type
+	 * stable.
+	 */
+	UNP_GLOBAL_RLOCK();
  	for (unp = LIST_FIRST(head), i = 0; unp && i < n;
  	     unp = LIST_NEXT(unp, unp_link)) {
+		UNP_PCB_LOCK(unp);
  		if (unp->unp_gencnt <= gencnt) {
  			if (cr_cansee(req->td->td_ucred,
  			    unp->unp_socket->so_cred))
  				continue;
  			unp_list[i++] = unp;
  		}
+		UNP_PCB_UNLOCK(unp);
  	}
-	UNP_UNLOCK();
+	UNP_GLOBAL_RUNLOCK();
  	n = i;			/* in case we lost some during malloc */

+	/*
+	 * XXXRW: The logic below asumes that it is OK to lock a mutex in
+	 * an unpcb that may have been freed.
+	 */
  	error = 0;
  	xu = malloc(sizeof(*xu), M_TEMP, M_WAITOK | M_ZERO);
  	for (i = 0; i < n; i++) {
  		unp = unp_list[i];
+		UNP_PCB_LOCK(unp);
  		if (unp->unp_gencnt <= gencnt) {
  			xu->xu_len = sizeof *xu;
  			xu->xu_unpp = unp;
@@ -1125,8 +1370,10 @@
  				      unp->unp_conn->unp_addr->sun_len);
  			bcopy(unp, &xu->xu_unp, sizeof *unp);
  			sotoxsocket(unp->unp_socket, &xu->xu_socket);
+			UNP_PCB_UNLOCK(unp);
  			error = SYSCTL_OUT(req, xu, sizeof *xu);
-		}
+		} else
+			UNP_PCB_UNLOCK(unp);
  	}
  	free(xu, M_TEMP);
  	if (!error) {
@@ -1157,24 +1404,37 @@
  static void
  unp_shutdown(struct unpcb *unp)
  {
+	struct unpcb *unp2;
  	struct socket *so;

-	UNP_LOCK_ASSERT();
+	UNP_GLOBAL_WLOCK_ASSERT();
+	UNP_PCB_LOCK_ASSERT(unp);

-	if (unp->unp_socket->so_type == SOCK_STREAM && unp->unp_conn &&
-	    (so = unp->unp_conn->unp_socket))
-		socantrcvmore(so);
+	unp2 = unp->unp_conn;
+	if (unp->unp_socket->so_type == SOCK_STREAM && unp2 != NULL) {
+		so = unp2->unp_socket;
+		if (so != NULL)
+			socantrcvmore(so);
+	}
  }

  static void
  unp_drop(struct unpcb *unp, int errno)
  {
  	struct socket *so = unp->unp_socket;
+	struct unpcb *unp2;

-	UNP_LOCK_ASSERT();
+	UNP_GLOBAL_WLOCK_ASSERT();
+	UNP_PCB_LOCK_ASSERT(unp);

  	so->so_error = errno;
-	unp_disconnect(unp);
+	unp2 = unp->unp_conn;
+	if (unp2 == NULL)
+		return;
+
+	UNP_PCB_LOCK(unp2);
+	unp_disconnect(unp, unp2);
+	UNP_PCB_UNLOCK(unp2);
  }

  static void
@@ -1184,7 +1444,6 @@
  	struct file *fp;

  	for (i = 0; i < fdcount; i++) {
-		fp = *rp;
  		/*
  		 * zero the pointer before calling
  		 * unp_discard since it may end up
@@ -1192,7 +1451,8 @@
  		 *
  		 * XXXRW: This is less true than it used to be.
  		 */
-		*rp++ = 0;
+		fp = *rp;
+		*rp++ = NULL;
  		unp_discard(fp);
  	}
  }
@@ -1212,7 +1472,7 @@
  	int f;
  	u_int newlen;

-	UNP_UNLOCK_ASSERT();
+	UNP_GLOBAL_UNLOCK_ASSERT();

  	error = 0;
  	if (controlp != NULL) /* controlp == NULL => free control messages */
@@ -1317,6 +1577,7 @@
  void
  unp_init(void)
  {
+
  	unp_zone = uma_zcreate("unpcb", sizeof(struct unpcb), NULL, NULL,
  	    NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE);
  	if (unp_zone == NULL)
@@ -1327,7 +1588,7 @@
  	LIST_INIT(&unp_dhead);
  	LIST_INIT(&unp_shead);
  	TASK_INIT(&unp_gc_task, 0, unp_gc, NULL);
-	UNP_LOCK_INIT();
+	UNP_GLOBAL_LOCK_INIT();
  }

  static int
@@ -1347,7 +1608,7 @@
  	int error, oldfds;
  	u_int newlen;

-	UNP_UNLOCK_ASSERT();
+	UNP_GLOBAL_UNLOCK_ASSERT();

  	error = 0;
  	*controlp = NULL;
@@ -1741,25 +2002,6 @@
  		unp_scan(m, unp_discard);
  }

-static int
-unp_listen(struct socket *so, struct unpcb *unp, int backlog,
-    struct thread *td)
-{
-	int error;
-
-	UNP_LOCK_ASSERT();
-
-	SOCK_LOCK(so);
-	error = solisten_proto_check(so);
-	if (error == 0) {
-		cru2x(td->td_ucred, &unp->unp_peercred);
-		unp->unp_flags |= UNP_HAVEPCCACHED;
-		solisten_proto(so, backlog);
-	}
-	SOCK_UNLOCK(so);
-	return (error);
-}
-
  static void
  unp_scan(struct mbuf *m0, void (*op)(struct file *))
  {
@@ -1812,6 +2054,9 @@
  static void
  unp_mark(struct file *fp)
  {
+
+	/* XXXRW: Should probably assert file list lock here. */
+
  	if (fp->f_gcflag & FMARK)
  		return;
  	unp_defer++;
@@ -1821,11 +2066,12 @@
  static void
  unp_discard(struct file *fp)
  {
-	UNP_LOCK();
+
+	UNP_GLOBAL_WLOCK();
  	FILE_LOCK(fp);
  	fp->f_msgcount--;
  	unp_rights--;
  	FILE_UNLOCK(fp);
-	UNP_UNLOCK();
+	UNP_GLOBAL_WUNLOCK();
  	(void) closef(fp, (struct thread *)NULL);
  }
--- sys/unpcb.h.old	2005/04/13 00:05:41
+++ sys/unpcb.h	2006/04/27 20:47:29
@@ -78,6 +78,7 @@
  	unp_gen_t unp_gencnt;		/* generation count of this instance */
  	int	unp_flags;		/* flags */
  	struct	xucred unp_peercred;	/* peer credentials, if applicable */
+	struct	mtx unp_mtx;		/* mutex */
  };

  /*
@@ -98,6 +99,14 @@
  #define	UNP_WANTCRED			0x004	/* credentials wanted */
  #define	UNP_CONNWAIT			0x008	/* connect blocks until accepted */

+/*
+ * These flags are used to handle non-atomicity in connect() and bind()
+ * operations on a socket: in particular, to avoid races between multiple
+ * threads or processes operating simultaneously on the same socket.
+ */
+#define	UNP_CONNECTING			0x010	/* Currently connecting. */
+#define	UNP_BINDING			0x020	/* Currently binding. */
+
  #define	sotounpcb(so)	((struct unpcb *)((so)->so_pcb))

  /* Hack alert -- this structure depends on <sys/socketvar.h>. */