Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 25 Jan 2014 20:55:47 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        J David <j.david.lists@gmail.com>
Cc:        freebsd-net@freebsd.org
Subject:   Re: Terrible NFS performance under 9.2-RELEASE?
Message-ID:  <278396201.16318356.1390701347722.JavaMail.root@uoguelph.ca>
In-Reply-To: <CABXB=RSGhshBe3CWDiQcis4fYYHqRbyQr70QiXM1nLMTSyCQvQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
J David wrote:
> On Fri, Jan 24, 2014 at 7:10 PM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
> > I would like to hear if you find Linux doing read before write when
> > you use "-r 2k", since I think that is writing less than a page.
> 
> It doesn't.  As I reported in the original test, I used an 8k
> rsize/wsize and a 4k write size on the Linux test and no
> read-before-write was observed.  And just now I did as you asked, a
> 2k
> test with Linux mounting with 32k rsize/wsize.  No extra reads,
> excellent performance.  FreeBSD, with the same mount options, does
> reads even on the appends in this case and can't.
> 
Well, when I get home in April, I'll try the fairly recent Linux client
I have at home and see what it does. Not sure what trick they could use
to avoid the read before write for partial pages. (I suppose I can
look at their sources, but that could be pretty scary;-)

If I understand the 15year old commit message, the main problem with
not doing the read before write for a partial buffer is that mmap()'d
file access will look at entire pages and potentially gets garbage if
the entire page isn't valid.
At this time, there is a single B_CACHE flag to indicate the buffer cache
entry has been filled in. I think it would be possible to add a bitmap
that marks which pages are actually allocated to the buffer cache entry,
but I suspect the coding would be non-trivial. This would help for the
case of page size writes on page boundaries, but would require the pages
to be read in before write when the writes are not of page size on page
boundaries.
Well, one application I do have some experience with is software builds
and the "ld" stage tends to write lots of chunks of odd sizes at any
byte offset. (When I did testing of some code that extended the single
dirty byte range to a list of dirty byte ranges, I discovered that "ld"
often generates 100+ of these odd sized non-contiguous writes before resulting
in a completely written block. I recently added a mount option called
"noncontigwr" that would allow the single dirty byte range to cover these
non-contiguous writes.)
Bottom line, if the pages were read in individually, the "ld" case would
result in several (up to 16 for 4K in a 64K buffer) small reads against
the server, which isn't nearly as efficient as one larger 64K read.

As mentioned above, I don't know how Linux would avoid the read before
write for partial blocks/pages being written.

rick

>                                                             random
>                                                              random
> 
>               KB  reclen   write rewrite    read    reread    read
>                 write
> 
> Linux    1048576       2  281082  358672                    125687
>  121964
> 
> FreeBSD  1048576       2   59042   22624                     10304
> 1933
> 
> 
> For comparison, here's the same test with 32k reclen (again, both
> Linux and FreeBSD using 32k rsize/wsize):
> 
>                                                             random
>                                                              random
> 
>               KB  reclen   write rewrite    read    reread    read
>                 write
> 
> Linux    1048576      32  319387  373021                    411106
>  364393
> 
> FreeBSD  1048576      32   74892   73703                     34889
>   66350
> 
> 
> Unfortunately it sounds like this state of affairs isn't really going
> to improve, at least in the near future.  If there was one area where
> I never thought Linux would surpass us, it was NFS. :(
> 
> Thanks!
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?278396201.16318356.1390701347722.JavaMail.root>