From owner-freebsd-hackers@FreeBSD.ORG Sat Jul 5 11:24:57 2014 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8045CA23; Sat, 5 Jul 2014 11:24:57 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 217112F24; Sat, 5 Jul 2014 11:24:56 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s65BOmC7050072 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 5 Jul 2014 14:24:48 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua s65BOmC7050072 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id s65BOmDH050071; Sat, 5 Jul 2014 14:24:48 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 5 Jul 2014 14:24:48 +0300 From: Konstantin Belousov To: Roger Pau Monn? Subject: Re: Strange IO performance with UFS Message-ID: <20140705112448.GQ93733@kib.kiev.ua> References: <53B691EA.3070108@citrix.com> <53B69C73.7090806@citrix.com> <20140705001938.54a3873dd698080d93d840e2@systemdatarecorder.org> <53B7C616.1000702@citrix.com> <20140705095831.GO93733@kib.kiev.ua> <53B7D4DF.40301@citrix.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="vXXM0T2D4JjNJBTG" Content-Disposition: inline In-Reply-To: <53B7D4DF.40301@citrix.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: freebsd-fs@freebsd.org, Stefan Parvu , FreeBSD Hackers X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 05 Jul 2014 11:24:57 -0000 --vXXM0T2D4JjNJBTG Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jul 05, 2014 at 12:35:11PM +0200, Roger Pau Monn? wrote: > On 05/07/14 11:58, Konstantin Belousov wrote: > > On Sat, Jul 05, 2014 at 11:32:06AM +0200, Roger Pau Monn? wrote: > >> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3=20 > >> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3=20 > >> kernel`g_io_request+0x384 kernel`ufs_strategy+0x8a=20 > >> kernel`VOP_STRATEGY_APV+0xf5 kernel`bufstrategy+0x46=20 > >> kernel`cluster_read+0x5e6 kernel`ffs_balloc_ufs2+0x1be2=20 > >> kernel`ffs_write+0x310 kernel`VOP_WRITE_APV+0x166=20 > >> kernel`vn_write+0x2eb kernel`vn_io_fault_doio+0x22=20 > >> kernel`vn_io_fault1+0x78 kernel`vn_io_fault+0x173=20 > >> kernel`dofilewrite+0x85 kernel`kern_writev+0x65=20 > >> kernel`sys_write+0x63 > >>=20 > >> This can also be seen by running iostat in parallel with the fio > >> workload: > >>=20 > >> device r/s w/s kr/s kw/s qlen svc_t %b ada0 > >> 243.3 233.7 31053.3 29919.1 31 57.4 100 > >>=20 > >> This clearly shows that even when I was doing a sequential write > >> (the fio workload shown above), the disk was actually reading > >> more data than writing it, which makes no sense, and all the > >> reads come from the path trace shown above. > >=20 > > The backtrace above means that the BA_CLRBUF was specified for > > UFS_BALLOC(). In turns, this occurs when the write size is less > > than the UFS block size. UFS has to read the block to ensure that > > partial write does not corrupt the rest of the buffer. >=20 > Thanks for the clarification, that makes sense. I'm not opening the > file with O_DIRECT, so shouldn't the write be cached in memory and > flushed to disk when we have the full block? It's a sequential write, > so the whole block is going to be rewritten very soon. >=20 > >=20 > > You can get the block size for file with stat(2), st_blksize field > > of the struct stat, or using statfs(2), field f_iosize of struct > > statfs, or just looking at the dumpfs output for your filesystem, > > the bsize value. For modern UFS typical value is 32KB. >=20 > Yes, block size is 32KB, checked with dumpfs. I've changed the block > size in fio to 32k and then I get the expected results in iostat and fio: >=20 > extended device statistics > device r/s w/s kr/s kw/s qlen svc_t %b > ada0 1.0 658.2 31.1 84245.1 58 108.4 101 > extended device statistics > device r/s w/s kr/s kw/s qlen svc_t %b > ada0 0.0 689.8 0.0 88291.4 54 112.1 99 > extended device statistics > device r/s w/s kr/s kw/s qlen svc_t %b > ada0 1.0 593.3 30.6 75936.9 80 111.7 97 >=20 > write: io=3D10240MB, bw=3D81704KB/s, iops=3D2553, runt=3D128339msec The current code in ffs_write() only avoids read before write when write covers complete block. I think we can somewhat loose the test to also avoid read when we are at EOF and write covers completely the valid portion of the last block. This leaves the unwritten portion of the block with the garbage. I believe that it is not harmful, since the only way for usermode to access that garbage is through the mmap(2). The vnode_generic_getpages() zeroes out parts of the page which are after EOF. Try this, almost completely untested: commit 30375741f5b15609e51cac5b242ecfe7d614e902 Author: Konstantin Belousov Date: Sat Jul 5 14:19:39 2014 +0300 Do not do read-before-write if the written area completely covers the valid portion of the block at EOF. diff --git a/sys/ufs/ffs/ffs_vnops.c b/sys/ufs/ffs/ffs_vnops.c index 423d811..b725932 100644 --- a/sys/ufs/ffs/ffs_vnops.c +++ b/sys/ufs/ffs/ffs_vnops.c @@ -729,10 +729,12 @@ ffs_write(ap) vnode_pager_setsize(vp, uio->uio_offset + xfersize); =20 /* - * We must perform a read-before-write if the transfer size - * does not cover the entire buffer. + * We must perform a read-before-write if the transfer + * size does not cover the entire buffer or the valid + * part of the last buffer for the file. */ - if (fs->fs_bsize > xfersize) + if (fs->fs_bsize > xfersize && (blkoffset !=3D 0 || + uio->uio_offset + xfersize < ip->i_size)) flags |=3D BA_CLRBUF; else flags &=3D ~BA_CLRBUF; --vXXM0T2D4JjNJBTG Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJTt+B/AAoJEJDCuSvBvK1Bw04QAJxzmWigObOC25XA/oIPk9zF ti5DfuhZVa+8MfwfXRWGyEPN7wI0ZTU/ZHO0wotsq6BdckMOlKI+taGQFgaMTMfE NI5YZbU5N5fj68Oy0Txx8/FM8uj7M+IudOXLyKoWPUMFv+P/eQR3XhzYP7NLuCIC a+oLzLMM9doE5mwycYwqVIhMzJrOY7vgxZOwGS6iJ0nlH5pGsVEE0RWYIG6Y0EzD rNlQ1BkX/UUiLgZHFTA29965NMdQ8nyhKJlDcEvY8y65MofdXevEBLHBrIxcU70L u01OOusrofn/VjI4n6AXLCmGZwxxDpcyAVtFGfaWRQTzi/4kKVFg3os/NAkEaNTW +yYq8QUWuMz9CK6rmcTQZlP+ufUkSGo3MFrzlyURsbp8rSRKTFRZsDXwlyhkD5oA lDpiVnHj4DQuBTy3BNLkrwN50d8/ygZFd3Y5nmLgZyyUUucQz+KIROMQJrBDm8rg hpXD65PTRAyc6zJeVV0RZPHgdAf2TLx+3HQM3tFfL2NLMbTQ9zrgg/F2eOUzE2Rg oS0mNgQUutlDsrgTpL7VP0CtmCuGWX+tffhYLZXSt84G1RzXRxIkxJ9ngLvYoQxM zfnCAr2bty3MujR6L+fIgMjGDn06Qov0wcmnoRYB7NdzfPZgySbM4DLznC5u+wfj h5xNmy+wZ79mAEPufMHJ =XS/S -----END PGP SIGNATURE----- --vXXM0T2D4JjNJBTG--