Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 29 Jul 2013 18:48:37 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-fs@FreeBSD.org, rmacklem@FreeBSD.org, Ali Niknam <ali@transip.nl>
Subject:   Re: nfsclient: incorrect st_blksize (bug?)
Message-ID:  <728189143.3592552.1375138117697.JavaMail.root@uoguelph.ca>
In-Reply-To: <20130729235447.S1849@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote:
> On Mon, 29 Jul 2013, Ali Niknam wrote:
> 
> > I've come across a problem that has proven to be unsolvable for me
> > so far. It
> > might be a bug in the NFS Client code, it also be my general lack
> > of
> > knowledge :). Can someone please give me a hint in the right
> > direction?
> >
> > This is the case:
> >
> > mount_nfs -o rsize=32768 -o wsize=32768 -o nfsv4 -o tcp host:/path
> > /mnt/nfs
> >
> > stat /mnt/nfs gives st_blksize of 4096 bytes.
> > statfs /mnt/nfs gives an iosize of 4096 bytes.
> >
This value is not the rsize, wsize being used for I/O RPCs on the wire.
For recent systems, you can find out what is being used on the wire via:
"nfsstat -m" executed as root.

For older FreeBSD systems, you can look at the read and write RPCs via
wireshark.

If you have specified 32768, that is what you should get unless the NFS
server has specified a smaller value.

The 4096 value is just the page size and it is set that way (rather bogusly)
to make the NFS paging code work correctly. (The only time this affects
what happens on the wire is when mmap'd files are read/written, since the
NFS VOP_GETPAGES/VOP_PUTPAGES() get 4096 byte pages to read/write. This
needs to be fixed someday by making the NFS paging VOPs do scatter/gather I/O.)

At least that is how I believe it works, although I am not familiar with the
vm side of things.

rick

> > Mounting with nfsv3 gives the same results, regardless of udp or
> > tcp
> > protocol. NFSv2 however seems to give a st_blksize of 128k, with an
> > iosize of
> > 8192 bytes.
> >
> > In short: it seems that with BSD 9.1 the rsize/wsize's arent passed
> > along
> > correctly. I tried to debug it by looking in the kernel code but I
> > got lost
> > unfortunately in the abstraction layers (everything seems to set
> > NFS_FABLKSIZE).
> >
> > Mounting the same host on a linux machine gives the correct
> > st_blksize (32k).
> >
> > The disadvantage is ofcourse that apache/etc adhere to the 4k
> > st_blksize by
> > only reading 4k chunks so that nfs io slows down substantially.
> 
> nfs still seems to seems to ask for a blocksize of NFS_FABLKSIZE =
> 512.  Old
> versions of FreeBSD honored the leaf file system's idea of the best
> block
> size and gave this 512.  After many intermediate broken versions,
> vn_stat()
> now has a hack that involves it using PAGE_SIZE iff the leaf file
> system
> prefers a smaller size, so 512 becomes 4096 on x86.  4096 is not as
> bad as
> 512, but still too small for most purposes.  OTOH, 512 works quite
> well for
> nfs over local networks with low latency.  512 fits in a 1500-byte
> packet
> but 4096 doesn't, so latency can be better with small block sizes and
> lower latency also gives higher throughput provided everything can
> keep
> up with the small blocks.
> 
> A workaround might by to use statfs() instead of stat().  st_blksize
> can vary within a file system in theory, but usually doesn't, and
> can't
> be trusted anyway.  struct statfs has fields f_bsize ("fragment"
> size)
> and f_iosize (optimal transfer size).  These seem to be set better by
> leaf file systems, and are certainly never frobbed by upper layers
> (except to translate to old statfs()).  nfs still seems to set
> f_bsize
> to NFS_FABSLKSIZE, but it sets f_iosize to its i/o size.  ffs sets
> f_bsize to its fragment size (not so good.  statfs() can't even
> respresent ffs's 2 types of block size.  Neither can stat(), but
> st_blksize is initialized with the other one, so unportable code can
> determine both).  ffs sets f_iosize to a disk-specific size.  There
> are many bugs in the setting of the latter too, and it now almost
> always reduces to a hard-coded setting of MAXPHYS that has nothing
> to do with disks' preferred sizes.  Hard-coding of MAXPHYS everywhere
> would be OK for throughput but not so good for latency.  To optimize
> for latency, there seems to be nothing better than using statfs()'s
> f_bsize, but we know that that reduces to a hard-coded 512 for nfs
> and to the not-necessarily best fragment size for ffs.
> 
> Bruce
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?728189143.3592552.1375138117697.JavaMail.root>