Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 17 Dec 1999 12:17:51 -0500 (EST)
From:      Andrew Gallatin <gallatin@cs.duke.edu>
To:        "Kenneth D. Merry" <ken@kdm.org>
Cc:        Matthew Dillon <dillon@apollo.backplane.com>, anderson@cs.duke.edu, Poul-Henning Kamp <phk@critter.freebsd.dk>, freebsd-current@FreeBSD.ORG
Subject:   Re: Serious server-side NFS problem
Message-ID:  <14426.25577.295630.812426@grasshopper.cs.duke.edu>
In-Reply-To: <19991216205554.A20410@panzer.kdm.org>
References:  <199912160758.BAA87332@celery.dragondata.com> <199912160801.AAA50074@apollo.backplane.com> <14425.33053.359447.429215@grasshopper.cs.duke.edu> <199912170328.TAA57721@apollo.backplane.com> <19991216205554.A20410@panzer.kdm.org>

next in thread | previous in thread | raw e-mail | index | archive | help

Kenneth D. Merry writes:
 > 
 > 
 > Another advantage with gigabit ethernet is that if you can do jumbo frames,
 > you can fit an entire 8K NFS packet in one frame.
 > 
 > I'd like to see NFS numbers from two 21264 Alphas with GigE cards, zero
 > copy, checksum offloading and a big striped array on one end at least.  I

Well.. maybe this will work for you ;-)

2 21264 alphas (500MHz XP1000S), 640MB RAM, Myrinet/Trapeze using
64-bit Myrinet cards, 8K cluster mbufs, UDP checksums disabled (we can
do checksum offloading at the receiver only).  We have a 56K MTU.
Using this setup, *without* zero copy, we get roughly 140MB/sec out of
TCP:

% netperf -Hbroil-my
TCP STREAM TEST to broil-my : histogram
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

524288 524288 524288    10.01    1135.20   

And about 900Mb/sec (112MB/sec) out of UDP using an 8k message size:

% netperf -Hbroil-my -tUDP_STREAM -- -m 8192
UDP UNIDIRECTIONAL SEND TEST to broil-my : histogram
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

 57344    8192   10.00      165619      0    1084.94
 65535           10.00      137338            899.68


I have exported a local disk on broil-my and created a 512MB file
(zot).  Both machines have 640MB of ram and the test file is fully
cached on the server.  When reading the file from the client, I have
found the best I can do is roughly 57MB/sec:

# mount_nfs -a 3 -r 16384 boil-my:/var/tmp /mnt
# dd if=/mnt/zot of=/dev/null bs=64k
8192+0 records in
8192+0 records out
536870912 bytes transferred in 9.658521 secs (55585209 bytes/sec)
# umount /mnt
# mount_nfs -a 3 -r 32768 boil-my:/var/tmp /mnt
# if=/mnt/zot of=/dev/null bs=64k
8192+0 records in
8192+0 records out
536870912 bytes transferred in 9.513517 secs (56432433 bytes/sec)

Emperically, it seems that -a 3 performs better than -a 2 or -a 4.
Also, the bandwidth seems to max out with a 16k read size.  Increasing 
much beyond that doesn't seem to help.  Varying the number if nfsiods 
across between 2,4 & 20 doesn't seem to matter much.  

Running iprobe on the client (http://www.cs.duke.edu/ari/iprobe.html)
shows us that we are spending:

- 29.4% in bcopy -- this doesn't change a lot if I enable/disable
vfs_ioopt.  I suspect that this is from bcopy'ing data out of mbufs,
not crossing the user/kernel boundary.  In either case, there's not
much that can be done to reduce this in a generic manner.

-  5.5% tsleep (contention between nfsiods?)

The "top" functions/components are:

Name                                     Count   Pct   Pct
--                                       -----   ---   ---
kernel                                    4128        90.0 
--------
bcopy_samealign_lp                        1347  32.6  29.4 
procrunnable                               279   6.8   6.1 
tsleep                                     256   6.2   5.6 
Lidle2                                     195   4.7   4.3 
m_freem                                     89   2.2   1.9 
soreceive                                   73   1.8   1.6 
lockmgr                                     63   1.5   1.4 
brelse                                      60   1.5   1.3 
vm_page_free_toq                            55   1.3   1.2 
ovbcopy                                     51   1.2   1.1 
wakeup                                      43   1.0   0.9 
acquire                                     42   1.0   0.9 
bcopy_da_lp                                 42   1.0   0.9 
nfs_request                                 41   1.0   0.9 
ip_input                                    40   1.0   0.9 
biodone                                     39   0.9   0.9 
nfs_readrpc                                 38   0.9   0.8 
vm_page_alloc                               36   0.9   0.8 
<...>
----------
/modules/tpz.ko                            435         9.5 

tpz.ko is the myrinet device driver.  This is saying that the system
spent 90% of its time in the static kernel, 9.5% in the device driver, 
and 0.5% in userland.

The server is also close to maxed-out.  I can provide an iprobe
breakdown for it as well, and/or complete breakdowns for the client
and server.  


Cheers,

Drew


------------------------------------------------------------------------------
Andrew Gallatin, Sr Systems Programmer	http://www.cs.duke.edu/~gallatin
Duke University				Email: gallatin@cs.duke.edu
Department of Computer Science		Phone: (919) 660-6590


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?14426.25577.295630.812426>