Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 5 Jul 2014 14:24:48 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Roger Pau Monn? <roger.pau@citrix.com>
Cc:        freebsd-fs@freebsd.org, Stefan Parvu <sparvu@systemdatarecorder.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: Strange IO performance with UFS
Message-ID:  <20140705112448.GQ93733@kib.kiev.ua>
In-Reply-To: <53B7D4DF.40301@citrix.com>
References:  <53B691EA.3070108@citrix.com> <53B69C73.7090806@citrix.com> <20140705001938.54a3873dd698080d93d840e2@systemdatarecorder.org> <53B7C616.1000702@citrix.com> <20140705095831.GO93733@kib.kiev.ua> <53B7D4DF.40301@citrix.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--vXXM0T2D4JjNJBTG
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jul 05, 2014 at 12:35:11PM +0200, Roger Pau Monn? wrote:
> On 05/07/14 11:58, Konstantin Belousov wrote:
> > On Sat, Jul 05, 2014 at 11:32:06AM +0200, Roger Pau Monn? wrote:
> >> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3=20
> >> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3=20
> >> kernel`g_io_request+0x384 kernel`ufs_strategy+0x8a=20
> >> kernel`VOP_STRATEGY_APV+0xf5 kernel`bufstrategy+0x46=20
> >> kernel`cluster_read+0x5e6 kernel`ffs_balloc_ufs2+0x1be2=20
> >> kernel`ffs_write+0x310 kernel`VOP_WRITE_APV+0x166=20
> >> kernel`vn_write+0x2eb kernel`vn_io_fault_doio+0x22=20
> >> kernel`vn_io_fault1+0x78 kernel`vn_io_fault+0x173=20
> >> kernel`dofilewrite+0x85 kernel`kern_writev+0x65=20
> >> kernel`sys_write+0x63
> >>=20
> >> This can also be seen by running iostat in parallel with the fio
> >> workload:
> >>=20
> >> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b ada0
> >> 243.3 233.7 31053.3 29919.1   31  57.4 100
> >>=20
> >> This clearly shows that even when I was doing a sequential write
> >> (the fio workload shown above), the disk was actually reading
> >> more data than writing it, which makes no sense, and all the
> >> reads come from the path trace shown above.
> >=20
> > The backtrace above means that the BA_CLRBUF was specified for
> > UFS_BALLOC(). In turns, this occurs when the write size is less
> > than the UFS block size. UFS has to read the block to ensure that
> > partial write does not corrupt the rest of the buffer.
>=20
> Thanks for the clarification, that makes sense. I'm not opening the
> file with O_DIRECT, so shouldn't the write be cached in memory and
> flushed to disk when we have the full block? It's a sequential write,
> so the whole block is going to be rewritten very soon.
>=20
> >=20
> > You can get the block size for file with stat(2), st_blksize field
> > of the struct stat, or using statfs(2), field f_iosize of struct
> > statfs, or just looking at the dumpfs output for your filesystem,
> > the bsize value.  For modern UFS typical value is 32KB.
>=20
> Yes, block size is 32KB, checked with dumpfs. I've changed the block
> size in fio to 32k and then I get the expected results in iostat and fio:
>=20
>                         extended device statistics
> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
> ada0       1.0 658.2    31.1 84245.1   58 108.4 101
>                         extended device statistics
> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
> ada0       0.0 689.8     0.0 88291.4   54 112.1  99
>                         extended device statistics
> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
> ada0       1.0 593.3    30.6 75936.9   80 111.7  97
>=20
> write: io=3D10240MB, bw=3D81704KB/s, iops=3D2553, runt=3D128339msec

The current code in ffs_write() only avoids read before write when
write covers complete block.  I think we can somewhat loose the test
to also avoid read when we are at EOF and write covers completely
the valid portion of the last block.

This leaves the unwritten portion of the block with the garbage. I
believe that it is not harmful, since the only way for usermode to
access that garbage is through the mmap(2). The vnode_generic_getpages()
zeroes out parts of the page which are after EOF.

Try this, almost completely untested:

commit 30375741f5b15609e51cac5b242ecfe7d614e902
Author: Konstantin Belousov <kib@freebsd.org>
Date:   Sat Jul 5 14:19:39 2014 +0300

    Do not do read-before-write if the written area completely covers
    the valid portion of the block at EOF.

diff --git a/sys/ufs/ffs/ffs_vnops.c b/sys/ufs/ffs/ffs_vnops.c
index 423d811..b725932 100644
--- a/sys/ufs/ffs/ffs_vnops.c
+++ b/sys/ufs/ffs/ffs_vnops.c
@@ -729,10 +729,12 @@ ffs_write(ap)
 			vnode_pager_setsize(vp, uio->uio_offset + xfersize);
=20
 		/*
-		 * We must perform a read-before-write if the transfer size
-		 * does not cover the entire buffer.
+		 * We must perform a read-before-write if the transfer
+		 * size does not cover the entire buffer or the valid
+		 * part of the last buffer for the file.
 		 */
-		if (fs->fs_bsize > xfersize)
+		if (fs->fs_bsize > xfersize && (blkoffset !=3D 0 ||
+		    uio->uio_offset + xfersize < ip->i_size))
 			flags |=3D BA_CLRBUF;
 		else
 			flags &=3D ~BA_CLRBUF;

--vXXM0T2D4JjNJBTG
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBAgAGBQJTt+B/AAoJEJDCuSvBvK1Bw04QAJxzmWigObOC25XA/oIPk9zF
ti5DfuhZVa+8MfwfXRWGyEPN7wI0ZTU/ZHO0wotsq6BdckMOlKI+taGQFgaMTMfE
NI5YZbU5N5fj68Oy0Txx8/FM8uj7M+IudOXLyKoWPUMFv+P/eQR3XhzYP7NLuCIC
a+oLzLMM9doE5mwycYwqVIhMzJrOY7vgxZOwGS6iJ0nlH5pGsVEE0RWYIG6Y0EzD
rNlQ1BkX/UUiLgZHFTA29965NMdQ8nyhKJlDcEvY8y65MofdXevEBLHBrIxcU70L
u01OOusrofn/VjI4n6AXLCmGZwxxDpcyAVtFGfaWRQTzi/4kKVFg3os/NAkEaNTW
+yYq8QUWuMz9CK6rmcTQZlP+ufUkSGo3MFrzlyURsbp8rSRKTFRZsDXwlyhkD5oA
lDpiVnHj4DQuBTy3BNLkrwN50d8/ygZFd3Y5nmLgZyyUUucQz+KIROMQJrBDm8rg
hpXD65PTRAyc6zJeVV0RZPHgdAf2TLx+3HQM3tFfL2NLMbTQ9zrgg/F2eOUzE2Rg
oS0mNgQUutlDsrgTpL7VP0CtmCuGWX+tffhYLZXSt84G1RzXRxIkxJ9ngLvYoQxM
zfnCAr2bty3MujR6L+fIgMjGDn06Qov0wcmnoRYB7NdzfPZgySbM4DLznC5u+wfj
h5xNmy+wZ79mAEPufMHJ
=XS/S
-----END PGP SIGNATURE-----

--vXXM0T2D4JjNJBTG--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140705112448.GQ93733>