Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 27 Nov 2013 10:52:45 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        FreeBSD FS <freebsd-fs@freebsd.org>
Subject:   Re: RFC: NFS client patch to reduce sychronous writes
Message-ID:  <20131127085245.GW59496@kib.kiev.ua>
In-Reply-To: <731168702.21452440.1385509273449.JavaMail.root@uoguelph.ca>
References:  <1139579526.21452374.1385509250511.JavaMail.root@uoguelph.ca> <731168702.21452440.1385509273449.JavaMail.root@uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help

--Wmw2osJQsLCWSWY8
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Nov 26, 2013 at 06:41:13PM -0500, Rick Macklem wrote:
> Hi,
>=20
> The current NFS client does a synchronous write
> to the server when a non-contiguous write to the
> same buffer cache block occurs. This is done because
> there is a single dirty byte range recorded in the
> buf structure. This results in a lot of synchronous
> writes for software builds (I believe it is the loader
> that loves to write small non-contiguous chunks to
> its output file). Some users disable synchronous
> writing on the server to improve performance, but
> this puts them at risk of data loss when the server
> crashes.
>=20
> Long ago jhb@ emailed me a small patch that avoided
> the synchronous writes by simply making the dirty byte
> range a superset of the bytes written. The problem
> with doing this is that for a rare (possibly non-existent)
> application that writes non-overlapping byte ranges
> to the same file from multiple clients concurrently,
> some of these writes might get lost by stale data in
> the superset of the byte range being written back to
> the server. (Crappy, run on sentence, but hopefully
> it makes sense;-)
>=20
> I created a patch that maintained a list of dirty byte
> ranges. It was complicated and I found that the list
> often had to be > 100 entries to avoid the synchronous
> writes.
>=20
> So, I think his solution is preferable, although I've
> added a couple of tweaks:
> - The synchronous writes (old/current algorithm) is still
>   used if there has been file locking done on the file.
>   (I think any app. that writes a file from multiple clients
>    will/should use file locking.)
> - The synchronous writes (old/current algorithm) is used
>   if a sysctl is set. This will avoid breakage for any app.
>   (if there is one) that writes a file from multiple clients
>   without doing file locking.
My feeling is that global sysctl is too coarse granularity for the
control. IMO the setting should be per-mount. It is not unreasonable to
have shared files on one export, and perform private client operations
on another. E.g. mailboxes and /usr/obj; I understand that mailboxes
case should be covered by the advlock heuristic. But note that if
advlock is acquired after some writes, the heuristic breaks.

Also, since the feature might cause very hard to diagnose corruption, I
think a facility to detect that dirty range coalescing was done would be
helpful. It could be a counter printed by mount -v, or even a warning
printed once per mount.

>=20
> For testing on my very slow single core hardware, I see about
> a 10% improvement in kernel build times, but with fewer I/O
> RPCs:
>              Read RPCs  Write RPCs
> old/current  50K        122K
> patched      39K         40K
> --> it reduced the Read RPC count by about 20% and cut the
>     Write RPC count to 1/3rd.
> I think jhb@ saw pretty good performance results with his patch.
>=20
> Anyhow, the patch is attached and can also be found here:
>   http://people.freebsd.org/~rmacklem/noncontig-write.patch
>=20
> I'd like to invite folks to comment/review/test this patch,
> since I think it is ready for head/current.
>=20
> Thanks, rick
> ps: Kostik, maybe you could look at it. In particular, I am
>     wondering if I zero'd out the buffer the correct way, via
>     vfs_bio_bzero_buf()?
Both vfs_bio_bzero_buf() and plain bzero() would work for NFS client,
since it does not use unmapped buffers.  Use of vfs_bio_bzero_buf()
is preferred since it makes one less place to find if converting NFS
to unmapped i/o.

In fact, I do not understand why the zeroing is needed. Could you, please,
comment on it ?

More, the zeroing seems to overwrite part of the previously valid
content of the buffer, unless I mis-read the patch. This breaks
the mmap/file consistency, since buffer pages may be mapped by usermode
process, and it could observe the 'out of thin air' zeroes before the
writing thread progressed to fill the zeroed part with new content.

--Wmw2osJQsLCWSWY8
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (FreeBSD)

iQIcBAEBAgAGBQJSlbLcAAoJEJDCuSvBvK1Bp0QQAKhA3TwLJTVpN4PjBac7ZbOX
SEKBXMTdIfaoEwiRHmmOFdfz6wryFk3VKan5a0tzsvdpkGwPiu7Yy0pYhH+9PIea
WKqHCjJVUJdf/a1f72Lw/T36/1lJ+8/73D1rbAzEH0i6mys0lFrQiDxmojRpTS3l
GCtv+p7KKdRyw235/u0zndn/ChtYHL0ehu6oRE5mU+6Pssbu5NIfAdm4uc6BWXeu
iyeYPpzkNUBWokA4OJ2YMpC44EvxEUVjRBMUy9DIjxyi1Ab4ohm6eTKp7Q/eguuy
BUn8B2TywWY5ZIFce/ocDuWgm9vuSvG4pPmMr3XcisdXyVwF5FOD7Qogz71nO7AV
AnpHCe0ytEHY04Re8fjHTeIAkWg78E767D3vIJOHDJUK3Hspfh0NTchFk7s8+1A0
ShYKawm/fXPr4kAZx6OuprgefxyaUc3VoLhNlpzIkryDnD4dWcVlBFM7N568cGSn
H5iubTBjm8ID+aoE27Hqda+/TzYeHWwiFmAnW/qFdg3ywQ/wTcriPpXT5j1DCKPA
eCLZJFSXGbGYV+OkYSfSIFjdcnmLdgoRxotMUkG9mMGJOUBfJ6Yw4XOKpbd+nmli
RvjwBs4nieWGQJKpaxH/Fue9iDzxSby2H9Cesc6rr+ksNQ7p5xiXdI46wEpuPrJW
lrjtp0lyTMTDjNWUui9i
=T4xD
-----END PGP SIGNATURE-----

--Wmw2osJQsLCWSWY8--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131127085245.GW59496>