From owner-freebsd-scsi@FreeBSD.ORG  Tue Feb 28 11:56:03 2006
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
X-Original-To: freebsd-scsi@FreeBSD.org
Delivered-To: freebsd-scsi@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id F14EA16A420
	for <freebsd-scsi@FreeBSD.org>; Tue, 28 Feb 2006 11:56:03 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B2F9F43D49
	for <freebsd-scsi@FreeBSD.org>; Tue, 28 Feb 2006 11:56:02 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.0.86])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 69FB6328F92;
	Tue, 28 Feb 2006 22:56:01 +1100 (EST)
Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	k1SBtwBb007002; Tue, 28 Feb 2006 22:55:59 +1100
Date: Tue, 28 Feb 2006 22:55:57 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@epsplex.bde.org
To: Danny Braniss <danny@cs.huji.ac.il>
In-Reply-To: <E1FE14w-0001r7-QO@cs1.cs.huji.ac.il>
Message-ID: <20060228220252.B1770@epsplex.bde.org>
References: <E1FE14w-0001r7-QO@cs1.cs.huji.ac.il>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-scsi@FreeBSD.org
Subject: Re: Qlogic fibre channel support questions 
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Feb 2006 11:56:04 -0000

On Tue, 28 Feb 2006, Danny Braniss wrote:

>> On Mon, 27 Feb 2006, Matthew Jacob wrote:
>>
>>> Okay- let me ask why diskless booting doesn't work for you?
>>
>>    Because NFS is slow.  A locally disk (or a SAN attached disk, which is
>> essentially the same to FreeBSD) is going to be faster than NFS, no matter what.
>
> don't be too hasty with conclusions :-)

> 2- as to speed, it all depends, specially on how deep are your pockets.
>   i've been running several 'benchmarks' latetly and disk speed is not
>   everything.
>
>   sample:
> 	host is a Sun Fire X4200 (dual dula core Opteron) with SAS disks
> 	OS is FreeBSD 6.1-PRERELEASE amd64.
>
> 	make buildworld:
> 	diskless: 	40m16.71s real 54m18.55s user 17m54.69s sys
> 		(using only 1 server*)
> 	nondiskless:	20m51.58s real 51m13.19s user 12m59.84s sys
> 	   "  but /usr/obj is iSCSI:
> 			28m23.29s real 52m17.27s user 14m23.06s sys
> 	   "  but /usr/src and /usr/obj is iSCSI:
> 			20m38.20s real 52m10.19s user 14m48.74s sys
> 	diskless but /usr/src and /usr/obj is iSCSI:
> 			20m22.66s real 50m56.14s user 13m8.20s sys
>
> *: server in this case is a Xeon running in 64 mode but not very fast
>   ethernet - em0 at 1gb but at about 50% efficiency.
>   this server will 'make buildworld' in about 40 min. using the onboard
>   LSILogic v3 MegaRAID RAID0.

I recently tried to use 1Gbps ethernet more (instead of 100Mbps) and
hoped to get better makeworld performance, but actually got less.  The
problem seems to be just that nfs3 does too many attribute cache
refreshes, so although all the data fits in the VMIO cache there is a
lot of network activity, and 100Gbps ends up slower because my 1Gbps
NICs have a slightly higher latency than my 100Mbps NICs.  The 100Mbps
ones are fxp's and have a ping latency of about 100uS, and the 1GBps
ones are a bge and an sk and have a ping latency of 140uS.  I think
these latencies are lower than average, but they are too large for
good makeworld-over-nfs performance.  makeworld generates about 2000
(or is it 5000?) packets/second and waiting just 40uS longer for 2000
replies reduces performance by 8% or about 120 seconds of the total
buildworld time.

The faster NICs are better for bandwidth.  I get a max of 40MB/S
for read/write using tcp and about 25MB/S using udp.  tcp is apparently
faster because the latency is so bad that streaming in tcp reduces its
effects significantly.  However, using tcp for makeworld is a pessimization.

All systems are 2-3GHz AthlonXPs with only 33MHz PCI buses running a 2
year old version of FreeBSD-current with local optimizations, with
/usr (including /usr/src) nfs3-mounted and local object and root trees
(initially empty).  "world" is actually only about 95% of the world.

100Mbps:
--------
      31532  maximum resident set size
       2626  average shared memory size
       1762  average unshared data size
        128  average unshared stack size
   15521898  page reclaims
      14904  page faults
          0  swaps
       1932  block input operations    <--- few of these since nfs bins and srcs
      11822  block output operations   <--- it's not disk-bound
    1883576  messages sent
    1883480  messages received
      33448  signals received
    2104163  voluntary context switches
     472277  involuntary context switches

1GBps/tcp:
-----------
      1930.89 real      1222.87 user       184.10 sys  <--- way slower (real)

1GBps/udp:
-----------
      1909.86 real      1225.25 user       181.22 sys


mostly local disks (except /usr, not including /usr/src):
---------------------------------------------------------
      1476.58 real      1224.70 user       161.30 sys  <---

This is almost a properly configured system, with disks fast enough for
real = user + sys + epsilon.

1GBps/udp + the best tuning/hacking I could find:
nfs access timeout 2 -> 60 (probably wrong for general use)
sk interrupt moderation 100 -> 10 (reduces latency)
delete zapping of attribute cache on open in nfs (probably a bug for general
use; a PR says that this should always be done for ro mounts)
----------------------------------------------------------------------------
      1630.86 real      1227.86 user       175.09 sys
    ...
    1342791  messages sent         <--- tuning seems to work mainly by reducing
    1343111  messages received     <--- these; they are still large

1GBps/udp + the best tuning I could find:
nfs access timeout 2 -> 60
sk interrupt moderation 100 -> 10
no zapping of attribute cache on open in nfs
-j4
-----------------------------------------------------------
      1599.74 real      1276.18 user       262.04 sys
    ...
    1727832  messages sent
    1726818  messages received

-j<ANY> is normally bad for UP systems, but here it helps by using cycles
that would otherwise be idle.

Bruce