From owner-freebsd-fs@FreeBSD.ORG Mon Apr 15 10:28:40 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 2DA33E1D for ; Mon, 15 Apr 2013 10:28:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id 9DFD91CE for ; Mon, 15 Apr 2013 10:28:39 +0000 (UTC) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r3FASVmN005203 for ; Mon, 15 Apr 2013 20:28:31 +1000 Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r3FASITk007638 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 15 Apr 2013 20:28:20 +1000 Date: Mon, 15 Apr 2013 20:28:18 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Rick Macklem Subject: Re: FreeBSD 9.1 NFSv4 client attribute cache not caching ? In-Reply-To: <1091296771.826148.1365989521302.JavaMail.root@erie.cs.uoguelph.ca> Message-ID: <20130415184639.V1081@besplex.bde.org> References: <1091296771.826148.1365989521302.JavaMail.root@erie.cs.uoguelph.ca> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=Ov0XUFDt c=1 sm=1 a=xj4t0lYZ87oA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=r-ufCDgtFiYA:10 a=jyafP8MNAAAA:8 a=gzcLvKzMasgLP25ndJgA:9 a=CjuIK1q_8ugA:10 a=gmjzRuXrkl8A:10 a=nRcZRO9L01eZmAkF:21 a=88WQ7wBtLHMueRgm:21 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Apr 2013 10:28:40 -0000 On Sun, 14 Apr 2013, Rick Macklem wrote: > Paul van der Zwan wrote: >> On 14 Apr 2013, at 5:00 , Rick Macklem wrote: >> >> Thanks for taking the effort to send such an extensive reply. >> >>> Paul van der Zwan wrote: >>>> On 12 Apr 2013, at 16:28 , Paul van der Zwan >>>> wrote: > ... >>> In NFSv3, each RPC is defined and usually includes attributes for >>> files >>> before and after the operation (implicit getattrs not counted in the >>> RPC >>> counts reported by nfsstat). >>> >>> For NFSv4, every RPC is a compound built up of a list of Operations >>> like >>> Getattr. Since the NFSv4 server doesn't know what the compound is >>> doing, >>> nfsstat reports the counts of Operations for the NFSv4 server, so >>> the counts >>> will be much higher than with NFSv3, but do not reflect the number >>> of RPCs being done. >>> To get NFSv4 nfsstat output that can be compared to NFSv3, you need >>> to >>> do the command on the client(s) and it still is only roughly the >>> same. >>> (I just realized this should be documented in man nfsstat.) >>> >> I ran nfsstat -s -v 4 on the server and saw the number of requests >> being done. >> They were in the order of a few thousand per second for a single >> FreeBSD 9.1 client >> doing a make build world. >> > Yes, but as I noted above, for NFSv4, these are counts of operations, > not RPCs. Each RPC in NFSv4 consists of several operations. For example, > for read it is something like: > - PutFH, Read, Getattr > > As such, you need to do "nfsstat -e -c" on the client in order to > see how many RPCs are happening. Does it show the number of physical RPC or only "roughly the same"? >>> For the FreeBSD NFSv4 client, the compounds include Getattr >>> operations >>> similar to what NFSv3 does. It doesn't do a Getattr on the directory >>> for Lookup, because that would have made the compound much more >>> complex. >>> I don't think this will have a significant performance impact, but >>> will >>> result in some additional Getattr RPCs. >>> >> I ran snoop on port 2049 on the server and I saw a large number of >> lookups. >> A lot of them seem to be for directories which are part of the >> filenames of >> the compiler and include files which on the nfs mounted /usr/obj. >> The same names keep reappering so it looks like there is no caching >> being done on >> the client. When I worked on this in ~2007, unnecessary RPCs for lookup was a large cause of slowness. This was fixed in at least nfsv3. Almost all RPCs for makeworld (closer to 99% than 90%) should now be for open of the excessively layered and polluted include files, since they are opened so often compared with other files and every open goes to the server (except "nocto" should fix this). There are lots of lookups for the include files too, but the lookups are properly cached. >> I tried the nocto option in /etc/fstab but it does not show when mount >> shows >> the mounted filesystems so I am not sure if it is being used. > Head (and I think stable9) is patched so that ``nfsstat -m`` shows > all the options actually being used. For 9.1, you just have to trust > that it has been set. This doesn't work on ref10-amd64 running 10.0-CURRENT Apr 5. nfsstat -m gives null output. Plain nfsstat confirms that there are some nfs mounts, with so much activity on them that man of the cache counts are negative after 9 days of uptime. > ... >> I tried a make buildworld buildkernel with /usr/obj a local FS in the >> Vbox VM >> that completed in about 2 hours. With /usr/obj on an NFS v4 filesystem >> it takes >> about a day. A twelve fold increase is elapsed time makes using NFSv4 >> unusable >> for this use case. That is extremely slow. Here I am unhappy with the makeworld time over nfs staying about 13 minutes despite attempts to improve this, but I only have old slow hardware (2 core 2GHz Turion laptop). I also have a modified FreeBSD-5, which avoids some of the bloat in -current. My best time without excessive tuning was: @ -------------------------------------------------------------- @ >>> make world completed on Fri Nov 2 23:35:11 EST 2007 @ (started Fri Nov 2 23:21:27 EST 2007) @ -------------------------------------------------------------- @ 823.53 real 1295.80 user 192.46 sys @ @ Lookup Read Access Fsstat Other Total @ 127134 23214 624060 24764 99 799271 The kernel was current at the time, but userland was ~5.2. Newer kernels (1-2 years old) are only a bit slower and don't require any modifications to get similar RPC counts (with Getattr.nstead of Access) /usr including /usr/bin and /usr/src was on nfs, but /bin and /usr/obj were local. Everything fits in RAM caches so there was no disk activity except for new reads and new writes. Network latency was tuned to 60 usec (min for ping). When nfs was pessimized, the above RPC counts blew out to no more than 2 million. Suppose you have 2 million RPCs with a latency of just 65 usec. That gives a latency of 130 seconds. Not too bad, but large compared with 823 seconds. They latency is amortized by having more than 1 CPU and/or building concurrently. Then progress can usually be made in some threads while others are blocked waiting for the RPCs. However, many networks have latencies much larger than 65 usec. On the freebsd cluster now, the min latency is about 250 usec, and since it it has multiple users the latency is sometimes over 1 msec. 2 million RPCs with a latency of 1 msec take 2000 seconds, which is a lot compared with a build time of 823 seconds. I consider "nocto" as excessive tuning, since although it would help makeworld benchmarks it is unsafe in general. Of course I tried my version of it in the above. (They above RPC counts are with the following critical modifications that weren't in FreeBSD at the time: - negative caching - fix for broken dotdot caching - fix for broken "cto". It did twice as many RPCs as needed.) Adding the equivalent of "nocto" reduced the RPC counts significantly, but only reduced the real time by about 20 (?) seconds. > Source builds on NFS mounts are notoriously slow. A big part of this is Only when misconfigured. The nfs build time in the above is between 5% and 10% slower than the local build time. > the synchronous writes that get done because there is only one dirty > byte range for a block and the loader loves to write small non-contiguous > areas of its output file. Writing to nfs would be slow, but I made /usr/obj local to avoid it. Also, in other (kernel build) tests where object files are written to the current directory which is on nfs, the non-separate object directory is mounted async on the server so it is fast enough. Now my reference is building a FreeBSD-4 kernel. My best times were: - 32+ seconds (src and obj on nfs, async, -j4) - 30- seconds (src and obj of ffs, async, -j4) - 64+ (?) seconds (src and obj on nfs, async, -j1) - 58 (?) seconds (src and obj on ffs, async, -j1) (/usr on nfs, /bin on ffs). Without parallelism, everything has to wait for the RPCs, and even with low network latency this costs 5-10%. >> Too bad the server hangs when I use nfsv3 mount for /usr/obj. > Try this mount command: > mount -t nfs -o nfsv3,nolockd ... > (I do builds of the src tree NFS mounted, so the only reason I can > think that it would hang would be a rpc.lockd issue.) > If this works, I suspect it will still be slow, but it would be nice to > find out how much slower NFSv4 is for your case. Needed to localize the slowness anyway. It might be just in the server. Bruce