Date: Sat, 25 Jan 2014 20:55:47 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: J David <j.david.lists@gmail.com> Cc: freebsd-net@freebsd.org Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <278396201.16318356.1390701347722.JavaMail.root@uoguelph.ca> In-Reply-To: <CABXB=RSGhshBe3CWDiQcis4fYYHqRbyQr70QiXM1nLMTSyCQvQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
J David wrote: > On Fri, Jan 24, 2014 at 7:10 PM, Rick Macklem <rmacklem@uoguelph.ca> > wrote: > > I would like to hear if you find Linux doing read before write when > > you use "-r 2k", since I think that is writing less than a page. > > It doesn't. As I reported in the original test, I used an 8k > rsize/wsize and a 4k write size on the Linux test and no > read-before-write was observed. And just now I did as you asked, a > 2k > test with Linux mounting with 32k rsize/wsize. No extra reads, > excellent performance. FreeBSD, with the same mount options, does > reads even on the appends in this case and can't. > Well, when I get home in April, I'll try the fairly recent Linux client I have at home and see what it does. Not sure what trick they could use to avoid the read before write for partial pages. (I suppose I can look at their sources, but that could be pretty scary;-) If I understand the 15year old commit message, the main problem with not doing the read before write for a partial buffer is that mmap()'d file access will look at entire pages and potentially gets garbage if the entire page isn't valid. At this time, there is a single B_CACHE flag to indicate the buffer cache entry has been filled in. I think it would be possible to add a bitmap that marks which pages are actually allocated to the buffer cache entry, but I suspect the coding would be non-trivial. This would help for the case of page size writes on page boundaries, but would require the pages to be read in before write when the writes are not of page size on page boundaries. Well, one application I do have some experience with is software builds and the "ld" stage tends to write lots of chunks of odd sizes at any byte offset. (When I did testing of some code that extended the single dirty byte range to a list of dirty byte ranges, I discovered that "ld" often generates 100+ of these odd sized non-contiguous writes before resulting in a completely written block. I recently added a mount option called "noncontigwr" that would allow the single dirty byte range to cover these non-contiguous writes.) Bottom line, if the pages were read in individually, the "ld" case would result in several (up to 16 for 4K in a 64K buffer) small reads against the server, which isn't nearly as efficient as one larger 64K read. As mentioned above, I don't know how Linux would avoid the read before write for partial blocks/pages being written. rick > random > random > > KB reclen write rewrite read reread read > write > > Linux 1048576 2 281082 358672 125687 > 121964 > > FreeBSD 1048576 2 59042 22624 10304 > 1933 > > > For comparison, here's the same test with 32k reclen (again, both > Linux and FreeBSD using 32k rsize/wsize): > > random > random > > KB reclen write rewrite read reread read > write > > Linux 1048576 32 319387 373021 411106 > 364393 > > FreeBSD 1048576 32 74892 73703 34889 > 66350 > > > Unfortunately it sounds like this state of affairs isn't really going > to improve, at least in the near future. If there was one area where > I never thought Linux would surpass us, it was NFS. :( > > Thanks! >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?278396201.16318356.1390701347722.JavaMail.root>