From owner-freebsd-fs Tue Mar 21 16:24:19 2000 Delivered-To: freebsd-fs@freebsd.org Received: from ns0.netcraft.com (ns0.netcraft.com [195.188.192.4]) by hub.freebsd.org (Postfix) with ESMTP id 2BC4A37BBEB; Tue, 21 Mar 2000 16:23:59 -0800 (PST) (envelope-from richard@netcraft.com) Received: (from richard@localhost) by ns0.netcraft.com (8.8.8/8.8.8) id AAA28786; Wed, 22 Mar 2000 00:22:42 GMT (envelope-from richard) From: Richard Wendland Message-Id: <200003220022.AAA28786@ns0.netcraft.com> Subject: FreeBSD random I/O performance issues In-Reply-To: <38D6BBD7.DA4B950B@originative.co.uk> from Paul Richards at "Mar 21, 2000 00:01:27 am" To: Paul Richards Date: Wed, 22 Mar 2000 00:22:42 +0000 (GMT) Cc: Alfred Perlstein , Poul-Henning Kamp , Matthew Dillon , current@FreeBSD.ORG, fs@FreeBSD.ORG X-Mailer: ELM [version 2.4ME+ PL61 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Paul Richards said in "Re: patches for test / review": > Richard, do you want to post a summary of your tests? Well I'd best post the working draft of my report on the issues I've seen, as I'm not going to have time to work on it in the near future, and it raises serious performance issues that are best looked at soon. Note none of these detailed results are from current, but Paul Richards has checked that these issues are still present in current. There are still issues to be explored so this report isn't in a complete state, and not polished. It's grown in 3 stages: - initial Berkeley DB (random I/O) performance problem analysis - side-issue of ATA outperforming SCSI systems at my synthetic benchmark - interesting dramatic performance changes from changing seek multiple and I/O block size one byte from 8192 Note I've cc'd freebsd-fs, as this raises issues in the filesystem area. I've also changed the subject since I think there are broader issues here than the clustering algorithm, and this email is rather large to drop into an ongoing discussion. The benchmark program source code is available, and easy to run, the bottom of the report has links. I don't have an explanation for the behaviour I have been measuring, but I hope these quite extensive results will enable someone to explain and perhaps suggest improvements. Richard. Folks, I appear to have found a serious performance problem with random access file I/O in FreeBSD, and have a simple C benchmark program which reproducibly demonstrates it. In that the benchmark demonstrates very poor non-async performance, this touches on the age-old sync/async filesystem argument, and FreeBSD vs Linux debates. I originally observed this problem with perl DB_File (Berkeley DB), and with the help of truss have synthesised this benchmark as a much simplified model of heavy Berkeley DB update behaviour. Quite probably other database-like software will have similar performance issues. This issue appears to be related to the traditional BSD behaviour of immediately scheduling full disc block writes. I think this benchmark must be showing up a related bug. But it is conceivable that this is intended noasync behaviour, in which case the implications need to be thought through. The program does simple random I/O within a 64KB file, which should I hope be fully cached so hardly any real I/O would be done. Other than mtime, this program makes no file meta-data or directory changes; and the file remains the same size. The file is used as 8 8KB blocks, and for each block in the order 0,5,2,7,4,1,6,3,0,... 10,000 lseek/read/lseek/write block updates are done, much like updating 10,000 non-localised Berkeley DB file records. Using a tiny 64KB file is just to simplify and make a point. My original perl performance problems were with multi-megabyte files, but still small enough to be fully cached. I ran this on a large range of lightly loaded or idle machines, which gave reproducible results. Results and a summary of the machines, which unless otherwise noted use SCSI 7200 RPM discs and Adaptec controllers, are given in descending performance order below. OS Elapse secs, system FreeBSD 3.2-RELEASE, async mount <1 (cheap ATA C433, 5400 RPM) Linux 2.2.13 <1 (Dell 1300, PIII 450MHz) Linux 2.0.36 3 (old ATA P200, 5400 RPM) Linux 2.0.36, sync [meta-data] mount 3 (old ATA P200, 5400 RPM) SunOS 5.5.1 (Solaris 2.5.1) 7 (old SS4/110, 5400 RPM) FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=5 15 (PII 450MHz, 512MB, 10k RPM) FreeBSD 2.2.7-RELEASE+CAM 21 (PII 400MHz, 512MB) FreeBSD 2.1.6.1-RELEASE 32 (old P100, 64MB) FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=2 39 (PII 400MHz, 512MB) FreeBSD 3.4-STABLE, vinum stripe+mirr=4 41 (dual PIII 500MHz, 1GB) FreeBSD 3.4-STABLE 41 (dual PIII 500MHz, 1GB) FreeBSD 2.1.6.1-RELEASE, ccd stripe=2 52 (old P100, 64MB) FreeBSD 3.3-RELEASE, ccd stripe=2 53 (Dell 1300, PIII 450MHz) FreeBSD 3.2-RELEASE 55 (cheap ATA C433, 5400 RPM) FreeBSD 3.2-RELEASE, noatime mount 55 (cheap ATA C433, 5400 RPM) FreeBSD 3.2-RELEASE, noclusterr mount 55 (cheap ATA C433, 5400 RPM) FreeBSD 3.2-RELEASE, noclusterw mount 58 (cheap ATA C433, 5400 RPM) FreeBSD 3.3-RELEASE 63 (Dell 1300, PIII 450MHz) FreeBSD 3.3-RELEASE, softupdates 63 (Dell 1300, PIII 450MHz) FreeBSD 3.2-RELEASE, sync mount 105 (cheap ATA C433, 5400 RPM) I also have a range of results from an ATA (IDE) cheap deskside Dell system running FreeBSD 3.3-RELEASE, with a range of wd(4) flags. This system exhibits much better performance than the SCSI systems above at this benchmark, perhaps related to better DMA ability. ATA being faster than SCSI on this benchmark is a bit of a side-issue to the thrust of this report, but the performance numbers may give hints diagnosing the problem. Dell Dimension XPS T450 440BX IBM-DPTA-372730 (Deskstar 34GXP, 7200RPM, 2MB buffer) default mount options wd(4) flags Elapse secs 0x0000 19 0x00ff, multi-sector transfer mode 17 0x8000, 32bit transfers 13 0x2000, bus-mastering DMA 4 0xa0ff, BM-DMA+32bit+multi-sector 4 Note that Linux performs about the same for [meta-data] sync & async mounts, which is as I'd expect for this program. But FreeBSD performance is hugely affected by async, sync or default (meta-data sync) filesystem mounts, with noclusterw unsurprisingly making it somewhat worse. One interesting observation is that for non sync, async or noclusterw mounts ~8750 I/O operations are done, which is 7/8ths of the 10,000 writes. If I change the program to use 16 blocks there are ~9375 I/O operations which is 15/16ths of the 10,000 writes. Guessing, this is as if writes are forced for all blocks but one. With async filesystem mounts very little I/O occurs, and with noclusterw there are ~10,000 operations matching the number of writes. With sync it's ~20,000 operations matching the total of reads & writes. This demonstrates another aspect of the bug, sync behaviour should cause 10,000 operations; the reads aren't being cached. A quick softupdates test suggests this makes no difference, as would be expected. Looking at mount output on FreeBSD 3 the substantial part of the I/O is async in all cases other than sync mounts; as expected. Another aspect of this issue is the effect of changing the seek blocksize, and write blocksize, by 1 byte each way from 8192, thus doing block unaligned I/O. In some cases this changes the amount of I/O recorded by getrusage to zero, and drops elapse time from half a minute or so to less than 1 second. Thanks to Paul Richard for noticing this. I've not spent much time researching this, so can only present my small set of measurements. To do these tests you have to recompile my test program each time eg gcc -O4 -DBLOCKSIZE=8191 -DWRITESIZE=8193 seekreadwrite.c Sorry it's that crude. These results are from a FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=2 (PII 400MHz, 512MB) system, though exactly the same pattern is apparent with 3.4-STABLE. "****" indicate sub-second "zero I/O" results. BLOCKSIZE WRITESIZE csh 'time' output 8191 8191 0.0u 1.5s 0:34.10 4.6% 5+186k 0+7500io 0pf+0w 8191 8192 0.0u 1.3s 0:31.52 4.5% 5+178k 0+7500io 0pf+0w 8191 8193 0.0u 1.4s 0:32.63 4.4% 5+189k 0+7500io 0pf+0w 8192 8191 0.0u 0.7s 0:01.97 37.5% 8+178k 0+0io 0pf+0w **** 8192 8192 0.0u 1.3s 0:39.30 3.4% 7+196k 0+8750io 0pf+0w 8192 8193 0.0u 1.3s 0:40.09 3.4% 5+187k 0+8750io 0pf+0w 8193 8191 0.0u 1.4s 0:46.22 3.2% 5+192k 0+8750io 0pf+0w 8193 8192 0.0u 1.6s 0:40.48 4.0% 5+182k 0+8750io 0pf+0w 8193 8193 0.0u 1.5s 0:40.57 3.8% 5+175k 0+8750io 0pf+0w 8191 4095 0.0u 1.2s 0:33.79 3.6% 5+193k 0+7500io 0pf+0w 8191 4096 0.0u 1.2s 0:34.00 3.8% 5+190k 0+7500io 0pf+0w 8191 4097 0.0u 1.1s 0:33.58 3.6% 4+165k 0+7500io 0pf+0w 8192 4095 0.0u 0.5s 0:00.76 75.0% 5+189k 0+0io 0pf+0w **** 8192 4096 0.0u 0.5s 0:00.58 100.0% 5+183k 0+0io 0pf+0w **** 8192 4097 0.0u 0.5s 0:00.74 78.3% 5+181k 0+0io 0pf+0w **** 8193 4095 0.0u 0.6s 0:01.00 67.0% 5+177k 0+0io 0pf+0w **** 8193 4096 0.0u 0.6s 0:01.05 63.8% 5+179k 0+0io 0pf+0w **** 8193 4097 0.0u 0.6s 0:01.02 66.6% 5+183k 0+0io 0pf+0w **** Any views gratefully received. A fix would be much better :-) Test program source, including compile & run instructions, is available at: http://www.netcraft.com/freebsd/random-IO/seekreadwrite.c Detailed notes on the test system configurations are at: http://www.netcraft.com/freebsd/random-IO/results-notes.txt Thanks, Richard - Richard Wendland richard@netcraft.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Mar 21 16:59:42 2000 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id E105B37BD80; Tue, 21 Mar 2000 16:59:29 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id QAA83848; Tue, 21 Mar 2000 16:59:25 -0800 (PST) (envelope-from dillon) Date: Tue, 21 Mar 2000 16:59:25 -0800 (PST) From: Matthew Dillon Message-Id: <200003220059.QAA83848@apollo.backplane.com> To: Richard Wendland Cc: Paul Richards , Alfred Perlstein , Poul-Henning Kamp , current@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: FreeBSD random I/O performance issues References: <200003220022.AAA28786@ns0.netcraft.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org :Paul Richards said in "Re: patches for test / review": : :> Richard, do you want to post a summary of your tests? : :Well I'd best post the working draft of my report on the issues :I've seen, as I'm not going to have time to work on it in the near :future, and it raises serious performance issues that are best :looked at soon. Note none of these detailed results are from :current, but Paul Richards has checked that these issues are still :present in current. : : (lots of good stuff) Interesting. The behavior is probably related closely to the write-behind methodology that UFS uses. A while back while fixing an O(N^2) degenerate condition in the buffer cache queueing code, DG and I had a long discussion of the write_behind behavior. I added a sysctl to 4.x that changes the write_behind behavior: sysctl vfs.write_behind 0 Turned off 1 Normal (default) 2 Backed off It would be interesting to see how the benchmark performs with write_behind turned off (set to 0). Note that a setting of 2 is highly experimental and will probably suffer from the same problem(s) that normal mode suffers from. (see below, I ran the benchmark) In general turning off write behind is *NOT* a good idea, because it saturates the buffer cache with dirty blocks and can lead to seriously degraded performance on a normal system due to write hogging. On the flip side, this was all before I put in the new buffer cache flushing code so it is possible that 4.x will not degrade as seriously with write behind turned off. I haven't run saturation tests recently with write_behind turned off. A secondary issue -- actually the reason *why* performance is so bad, is that the buffer cache nominally locks the underlying VM pages when issuing a write and this is almost certainly the cause of the program stalls. When a program writes a piece of data (and I/O is started immediately), and then reads it back later on, the read operation may stall even though the data is in the cache due to the write not having yet completed. The write operation might also stall if another nearby write is in progress (I'm not sure on that last point). Kirk has made significant improvements to stalls related to bitmap operations. I'm not sure if softupdates must be turned on or not to get these improvements. The data blocks can still stall, though, but part of the plan for later this year is to fix that too. :The benchmark program source code is available, and easy to run, :the bottom of the report has links. test3:/test/tmp# sysctl -w vfs.write_behind=0 (turned off) test3:/test/tmp# time ./seekreadwrite xxx 10000 0.125u 0.807s 0:00.93 98.9% 5+181k 0+0io 0pf+0w test3:/test/tmp# sysctl -w vfs.write_behind=1 (normal) test3:/test/tmp# time ./seekreadwrite xxx 10000 0.040u 1.709s 0:32.57 5.3% 4+174k 0+8750io 0pf+0w :I also have a range of results from an ATA (IDE) cheap deskside :Dell system running FreeBSD 3.3-RELEASE, with a range of wd(4) :flags. This system exhibits much better performance than the SCSI :systems above at this benchmark, perhaps related to better DMA :ability. : :ATA being faster than SCSI on this benchmark is a bit of a side-issue :to the thrust of this report, but the performance numbers may give :hints diagnosing the problem. IDE drives sometimes appear to be faster because they fake the write-completion response (they return the response prior to the write actually completing). It could also simply be that the lack of any real mixed I/O (due to the file being so small) is a slightly faster operation on an IDE drive. I wouldn't read much into it... where SCSI really shines is in more heavily loaded environments. -Matt Matthew Dillon :Thanks, : Richard :- :Richard Wendland richard@netcraft.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Mar 21 18:45:31 2000 Delivered-To: freebsd-fs@freebsd.org Received: from mailgate.originative.co.uk (mailgate.originative.co.uk [194.217.50.228]) by hub.freebsd.org (Postfix) with ESMTP id 799A437C048; Tue, 21 Mar 2000 18:45:18 -0800 (PST) (envelope-from paul@originative.co.uk) Received: from originative.co.uk (lobster.originative.co.uk [194.217.50.241]) by mailgate.originative.co.uk (Postfix) with ESMTP id 614EA1D131; Wed, 22 Mar 2000 02:45:16 +0000 (GMT) Message-ID: <38D833BC.A082DF09@originative.co.uk> Date: Wed, 22 Mar 2000 02:45:16 +0000 From: Paul Richards Organization: Originative Solutions Ltd X-Mailer: Mozilla 4.7 [en] (X11; I; FreeBSD 4.0-CURRENT i386) X-Accept-Language: en-GB, en MIME-Version: 1.0 To: Richard Wendland Cc: Alfred Perlstein , Poul-Henning Kamp , Matthew Dillon , current@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: FreeBSD random I/O performance issues References: <200003220022.AAA28786@ns0.netcraft.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Richard Wendland wrote: > I spent a bit of time analysing these results when I first saw them. I don't think it has anything to do with the cache, it has to do with how we write out blocks. > One interesting observation is that for non sync, async or noclusterw > mounts ~8750 I/O operations are done, which is 7/8ths of the 10,000 > writes. If I change the program to use 16 blocks there are ~9375 > I/O operations which is 15/16ths of the 10,000 writes. Guessing, > this is as if writes are forced for all blocks but one. This is due to a quirk of the clustering algorithm. See below or my previous email. > With async filesystem mounts very little I/O occurs, and with > noclusterw there are ~10,000 operations matching the number of > writes. > > With sync it's ~20,000 operations matching the total of reads & > writes. This demonstrates another aspect of the bug, sync behaviour > should cause 10,000 operations; the reads aren't being cached. This isn't quite true. It's 20,000 *write* operations. I put this down to the mtime update for each write doubling the number of actual write operations. No read operations take place, the data *does* come out of the cache. There's nothing wrong with reading as far as I can tell. > Another aspect of this issue is the effect of changing the seek > blocksize, and write blocksize, by 1 byte each way from 8192, thus > doing block unaligned I/O. In some cases this changes the amount > of I/O recorded by getrusage to zero, and drops elapse time from > half a minute or so to less than 1 second. > > Thanks to Paul Richard for noticing this. I've not spent much time > researching this, so can only present my small set of measurements. > To do these tests you have to recompile my test program each time eg > > gcc -O4 -DBLOCKSIZE=8191 -DWRITESIZE=8193 seekreadwrite.c This is because of the fact that if the filesystem block is full it is written immediately, or rather the clustering code is called immediately. The rationale is that a full block isn't likely to be written to again so it might as well be pushed out to disk. Richard's program deliberately writes full blocks, which is apparently what db does, so it always forces a write to take place. Given the behaviour of db it might be more sensible to remove this feature and just mark full blocks dirty the same as other blocks since it's likely that they will be written to again shortly if the db record is written to frequently. The clustering code has a bug in that an old cluster is not pushed out if the block no is 0 because the code that would do so never gets reached. if (lbn == 0) vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0; if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 || (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) { maxclen = vp->v_mount->mnt_iosize_max / lblocksize - 1; if (vp->v_clen != 0) { /* * Next block is not sequential. * * If we are not writing at end of file, the process * seeked to another point in the file since its last * write, or we have reached our maximum cluster size, * then push the previous cluster. Otherwise try * reallocating to make it sequential. */ ............ In Richard's program the next block is never sequential so the previous cluster is always pushed *except* that when the program seeks back to block zero the "if (vp->v_clen != 0)" fails and a new cluster is started without pushing out the previously started one. That dirty block in the previous cluster then hangs around until it is flushed as dirty blocks normally would be. It is the combination of this clustering behaviour and the fact that the program always writes full blocks that causes the 8750 writes below. Since the blocks are full file system blocks rather than mark them dirty they are immediately passed to the clustering code, because they are never in sequence the clustering code always starts a new cluster and flushes the previous one except for 1 in every 8 blocks that doesn't happen because when block 0 is written the previous cluster is not pushed out but hangs around. The end result is that 7/8 blocks get written immediately which is 8750/10000 writes. When the write size drops below the filesystem block size then the clustering code never gets called because the buffers are just marked dirty and cached. I think if we fixed the issue of writing out full blocks this behviour would stop but I also think the clustering code could do with a fix. It should at least check to see if there is a cluster being built when the blockno is 0 and push it out. Possibly though it'd be better to not push out clusters of only one block and just leave them in the cache. > > Sorry it's that crude. These results are from a FreeBSD > 2.2.7-RELEASE+CAM, ccd stripe=2 (PII 400MHz, 512MB) system, > though exactly the same pattern is apparent with 3.4-STABLE. > "****" indicate sub-second "zero I/O" results. > > BLOCKSIZE WRITESIZE csh 'time' output > > 8191 8191 0.0u 1.5s 0:34.10 4.6% 5+186k 0+7500io 0pf+0w > 8191 8192 0.0u 1.3s 0:31.52 4.5% 5+178k 0+7500io 0pf+0w > 8191 8193 0.0u 1.4s 0:32.63 4.4% 5+189k 0+7500io 0pf+0w > > 8192 8191 0.0u 0.7s 0:01.97 37.5% 8+178k 0+0io 0pf+0w **** > 8192 8192 0.0u 1.3s 0:39.30 3.4% 7+196k 0+8750io 0pf+0w > 8192 8193 0.0u 1.3s 0:40.09 3.4% 5+187k 0+8750io 0pf+0w > > 8193 8191 0.0u 1.4s 0:46.22 3.2% 5+192k 0+8750io 0pf+0w > 8193 8192 0.0u 1.6s 0:40.48 4.0% 5+182k 0+8750io 0pf+0w > 8193 8193 0.0u 1.5s 0:40.57 3.8% 5+175k 0+8750io 0pf+0w > > 8191 4095 0.0u 1.2s 0:33.79 3.6% 5+193k 0+7500io 0pf+0w > 8191 4096 0.0u 1.2s 0:34.00 3.8% 5+190k 0+7500io 0pf+0w > 8191 4097 0.0u 1.1s 0:33.58 3.6% 4+165k 0+7500io 0pf+0w > > 8192 4095 0.0u 0.5s 0:00.76 75.0% 5+189k 0+0io 0pf+0w **** > 8192 4096 0.0u 0.5s 0:00.58 100.0% 5+183k 0+0io 0pf+0w **** > 8192 4097 0.0u 0.5s 0:00.74 78.3% 5+181k 0+0io 0pf+0w **** > > 8193 4095 0.0u 0.6s 0:01.00 67.0% 5+177k 0+0io 0pf+0w **** > 8193 4096 0.0u 0.6s 0:01.05 63.8% 5+179k 0+0io 0pf+0w **** > 8193 4097 0.0u 0.6s 0:01.02 66.6% 5+183k 0+0io 0pf+0w **** > > Any views gratefully received. A fix would be much better :-) > > Test program source, including compile & run instructions, is > available at: > > http://www.netcraft.com/freebsd/random-IO/seekreadwrite.c > > Detailed notes on the test system configurations are at: > > http://www.netcraft.com/freebsd/random-IO/results-notes.txt > > Thanks, > Richard > - > Richard Wendland richard@netcraft.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Mar 21 22:18: 4 2000 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 80A3637BB18; Tue, 21 Mar 2000 22:17:57 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id WAA86154; Tue, 21 Mar 2000 22:17:52 -0800 (PST) (envelope-from dillon) Date: Tue, 21 Mar 2000 22:17:52 -0800 (PST) From: Matthew Dillon Message-Id: <200003220617.WAA86154@apollo.backplane.com> To: Paul Richards Cc: Richard Wendland , Alfred Perlstein , Poul-Henning Kamp , current@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: FreeBSD random I/O performance issues References: <200003220022.AAA28786@ns0.netcraft.com> <38D833BC.A082DF09@originative.co.uk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org :written immediately which is 8750/10000 writes. : :When the write size drops below the filesystem block size then the :clustering code never gets called because the buffers are just marked :dirty and cached. : :I think if we fixed the issue of writing out full blocks this behviour :would stop but I also think the clustering code could do with a fix. It :should at least check to see if there is a cluster being built when the :blockno is 0 and push it out. Possibly though it'd be better to not push :out clusters of only one block and just leave them in the cache. Hmm. Your analysis is correct but I don't think it's worth fixing the block-is-0 case. It may be worth revisiting the write-behind code to try to give it the ability to better discern random I/O from sequential I/O (e.g. perhaps it should ignore unaligned full blocks). It is perfectly ok for dirty blocks to remain in the buffer cache. In fact, it's *optimal* to leave them in the buffer cache as long as the buffer cache does not get saturated with them. The buffer cache is perfectly capable of clustering delayed writes. Also, the filesystem syncer comes along every 30 seconds or so anyway and flushes everything out. What the write-behind code tries to do is to prevent the buffer cache from being saturated with dirty buffers and to smooth out disk write I/O. It makes the assumption that write-behind data is not typically accessed by the program immediately after being written -- an assumption that winds up being incorrect in the DBM case you tested and resulting in stalls due to the buffer / VM pages being locked during the write I/O. The stalls are *not* due to the I/O itself but instead are due to side effects of the I/O being in-progress. If a user program doesn't access any of the information it recently wrote the whole mechanism winds up operating asynchronously in the background. If a user program does, then the write behind mechanism breaks down and you get a stall. The most common dirty-data case the filesystem has to deal with is appending to a file -- that is, doing piecemeal sequential writes. There are virtually no other cases which have the ability to saturate the buffer cache. This is why the write-behind code only tries to handle the piecemeal-write-flush-full-blocks case. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Mar 22 7:46: 3 2000 Delivered-To: freebsd-fs@freebsd.org Received: from ns0.netcraft.com (ns0.netcraft.com [195.188.192.4]) by hub.freebsd.org (Postfix) with ESMTP id 2076B37BD9A; Wed, 22 Mar 2000 07:45:56 -0800 (PST) (envelope-from richard@netcraft.com) Received: (from richard@localhost) by ns0.netcraft.com (8.8.8/8.8.8) id PAA08760; Wed, 22 Mar 2000 15:44:20 GMT (envelope-from richard) From: Richard Wendland Message-Id: <200003221544.PAA08760@ns0.netcraft.com> Subject: Re: FreeBSD random I/O performance issues In-Reply-To: <38D833BC.A082DF09@originative.co.uk> from Paul Richards at "Mar 22, 2000 02:45:16 am" To: Paul Richards Date: Wed, 22 Mar 2000 15:44:20 +0000 (GMT) Cc: Richard Wendland , Alfred Perlstein , Poul-Henning Kamp , Matthew Dillon , current@FreeBSD.ORG, fs@FreeBSD.ORG X-Mailer: ELM [version 2.4ME+ PL61 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > With sync it's ~20,000 operations matching the total of reads & > > writes. This demonstrates another aspect of the bug, sync behaviour > > should cause 10,000 operations; the reads aren't being cached. > > This isn't quite true. It's 20,000 *write* operations. I put this down > to the mtime update for each write doubling the number of actual write > operations. No read operations take place, the data *does* come out of > the cache. There's nothing wrong with reading as far as I can tell. Yes, you're absolutely right, I should have looked at my own data more closely. If I change the test program to call fsync after write, and run on a default mount filesystem I also see 20,000 I/O operations from 10,000 writes. This probably impacts more real programs out there than sync mounts. If this is mtime updates being does synchronously, that seems a separate issue to the clustering/VM issue, and seems to me it should be fixed. It'll normally double the number of all writes won't it, possibly forcing seeks between otherwise localised access. Can anyone offer an alternative hypothesis to mtime updates being done synchronously? Looking at my logs for the sync filesystem test, mount output before and after shows all ~20,000 operations are writes:: mount /dev/wd0s2e on /var (local, synchronous, writes: sync 182 async 10) time ./seekreadwrite xxx 10000 0.1u 7.8s 1:47.61 7.4% 5+179k 0+20000io 0pf+0w mount /dev/wd0s2e on /var (local, synchronous, writes: sync 20190 async 15) But when using fsync on a default mount filesystems, 10000 writes are sync and 10000 async: mount /dev/wd0s2e on /var (local, writes: sync 682 async 2764) time ./seekreadwrite xxx 10000 0.0u 1.7s 0:54.34 3.3% 4+171k 0+20000io 0pf+0w mount /dev/wd0s2e on /var (local, writes: sync 10682 async 12777) This is on the ATA machine that could run the test in 4 seconds without fsync, 54 seconds with fsync, suggesting some head movements may be being forced, though not 20000 as that would imply 2.7ms per seek. Richard -- Richard Wendland richard@netcraft.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Mar 22 12:38:46 2000 Delivered-To: freebsd-fs@freebsd.org Received: from mass.cdrom.com (mg134-217.ricochet.net [204.179.134.217]) by hub.freebsd.org (Postfix) with ESMTP id 1EB4A37C24E; Wed, 22 Mar 2000 12:38:32 -0800 (PST) (envelope-from msmith@mass.cdrom.com) Received: from mass.cdrom.com (localhost [127.0.0.1]) by mass.cdrom.com (8.9.3/8.9.3) with ESMTP id MAA00661; Wed, 22 Mar 2000 12:39:46 -0800 (PST) (envelope-from msmith@mass.cdrom.com) Message-Id: <200003222039.MAA00661@mass.cdrom.com> X-Mailer: exmh version 2.1.1 10/15/1999 To: Matthew Dillon Cc: Paul Richards , Richard Wendland , Alfred Perlstein , Poul-Henning Kamp , current@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: FreeBSD random I/O performance issues In-reply-to: Your message of "Tue, 21 Mar 2000 22:17:52 PST." <200003220617.WAA86154@apollo.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 22 Mar 2000 12:39:42 -0800 From: Mike Smith Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > effects of the I/O being in-progress. If a user program doesn't access > any of the information it recently wrote the whole mechanism winds up > operating asynchronously in the background. If a user program does, > then the write behind mechanism breaks down and you get a stall. What makes no sense is that it should be perfectly ok to _read_ this information back. -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ msmith@freebsd.org \\ and he'll hate you for a lifetime. \\ msmith@cdrom.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Mar 22 16:10:44 2000 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 1803B37C259; Wed, 22 Mar 2000 16:10:40 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id QAA94351; Wed, 22 Mar 2000 16:10:39 -0800 (PST) (envelope-from dillon) Date: Wed, 22 Mar 2000 16:10:39 -0800 (PST) From: Matthew Dillon Message-Id: <200003230010.QAA94351@apollo.backplane.com> To: Mike Smith Cc: Paul Richards , Richard Wendland , Alfred Perlstein , Poul-Henning Kamp , current@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: FreeBSD random I/O performance issues References: <200003222039.MAA00661@mass.cdrom.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org : :> effects of the I/O being in-progress. If a user program doesn't access :> any of the information it recently wrote the whole mechanism winds up :> operating asynchronously in the background. If a user program does, :> then the write behind mechanism breaks down and you get a stall. : :What makes no sense is that it should be perfectly ok to _read_ this :information back. When we separate out the read vs write access in the buffer cache API we *will* be able to read the information back while a write is in progress. At the moment the buffer cache has no clue how a buffer is going to be used, which means the buffer is locked exclusively. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Mar 22 16:17:37 2000 Delivered-To: freebsd-fs@freebsd.org Received: from fLuFFy.iNt.tElE.dK (fw1.inet.tele.dk [193.163.158.4]) by hub.freebsd.org (Postfix) with ESMTP id B20CC37C2B9; Wed, 22 Mar 2000 16:17:17 -0800 (PST) (envelope-from pedophile@INT.TELE.DK) Received: from localhost (pedophile@localhost) by fLuFFy.iNt.tElE.dK (8.9.3/8.9.3) with SMTP id BAA86413; Thu, 23 Mar 2000 01:17:06 +0100 (CET) (envelope-from pedophile@INT.TELE.DK) X-Authentication-Warning: fLuFFy.iNt.tElE.dK: pedophile owned process doing -bs Date: Thu, 23 Mar 2000 01:17:06 +0100 (CET) From: FREENIX IS OVERRATED Reply-To: FreeBSD-abusers@netscum.dk To: Matthew Dillon Cc: current@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: FreeBSD random I/O performance issues In-Reply-To: Message-ID: X-Pedophile: BARRY BOUWSMA IS AN OFFENSIVE USENET PEDOPHILE MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, 2395 Sep 1993, Matthew Dillon wrote: > It is perfectly ok for dirty blocks to remain in the buffer cache. In > fact, it's *optimal* to leave them in the buffer cache as long as the > buffer cache does not get saturated with them. The buffer cache is > perfectly capable of clustering delayed writes. Also, the filesystem > syncer comes along every 30 seconds or so anyway and flushes everything > out. > > What the write-behind code tries to do is to prevent the buffer cache > from being saturated with dirty buffers and to smooth out disk write > I/O. It makes the assumption that write-behind data is not typically > accessed by the program immediately after being written -- an assumption > that winds up being incorrect in the DBM case you tested and resulting > in stalls due to the buffer / VM pages being locked during the write I/O. > The stalls are *not* due to the I/O itself but instead are due to side > effects of the I/O being in-progress. And that sounds a heck of a lot like what those of us who have been running INN news swervers with 1,1GB size text history files on 2.whazzit (now dead, may it rest in pieces widely-scattered) and later have seen. You should have forgotten that a couple months or so ago, I wrote to one of these lists to ask why I was getting only about 50-70% availability as my 1.5+MD5-based-dbz innd was stuck in ufslck2 during these every-30-seconds syncs. The .hash and .index files from this, which are comparable to the dbm (dbz) files being typically 125MB and 85MB or so, this under 3.4-STABLE. Well, I've meant to get around to trying 4.0 on it, and Real Soon Now I will, but I wanted to relate my experiences in turning traitor, a heretic who has left the fold, deserving to be ridden out of town on a rail and stuff, which sounds like a lot of fun. I tried NetBSD. NetBSD (at least the development now 1.4V version) has trickle syncing, which seems to work quite well when having to cope with these rather large database files, keeping a full 14 days of message IDs from a full news feed. Without really tuning anything, after a bit of time, the time needed to do history lookups drops to microseconds, and as long as a `sync' isn't needed, innd doesn't get stuck. Theoretically, a sync, where you are in fact seeking rather wildly over the disk to update these files, happens once a day at expiry. Depending on the speed of the drive (and I haven't optimized this at all, using a single drive for OS, logs, history, and part of spool, with a second drive for the rest of the spool, far from an ideal setup), this seems to mean only a few minutes of downtime. Actually building the new .index and .hash files goes quite a bit faster, like by a factor of 3 to 4, so clearly the update of these files during the `sync' could stand improved sorting. I wouldn't complain a bit if you were to steal mercilessly from the NetBSD k0deZ to incorporate trickle sync (if something comparable is not already in place) since that seems to make a world of difference for those of us using long-outdated INN code and who want to have bigger history file sizes than our shriveling Freenix members. (What kills me now is that I'm using a single drive to hold the news spool apart from a small overflow, so while time spent accessing this history database is way down, the time actually spent hopping around the disk to write (and read, for our sluggish peers) articles has skyrocketed. The box I'll try 4.0 on has a separate disk pack that is far faster under NetBSD. Test boxen, eh?) There. I've confessed. It feels really good. Now have at me. Naturally, since I haven't followed this discussion closely, you may be talking about something completely different, but I did want to mention generally improved (yet not totally perfect) performance with huge INN database files and NetBSD's trickle syncing. Now, go out and steal some k0deZ, okay? barry bouwsma, tele danMerika internet -- *** This was posted with the express permission of *** ****************************************************** ** HIS HIGHNESS KAAZMANN LORD AND MASTER OF USENET ** ****************************************************** ********* We are simple servants of his will ********* To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Mar 22 16:48:47 2000 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 413E037B5BD; Wed, 22 Mar 2000 16:48:43 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id QAA94830; Wed, 22 Mar 2000 16:48:40 -0800 (PST) (envelope-from dillon) Date: Wed, 22 Mar 2000 16:48:40 -0800 (PST) From: Matthew Dillon Message-Id: <200003230048.QAA94830@apollo.backplane.com> To: FREENIX IS OVERRATED Cc: current@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: FreeBSD random I/O performance issues References: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org :> out. :> :> What the write-behind code tries to do is to prevent the buffer cache :> from being saturated with dirty buffers and to smooth out disk write :> I/O. It makes the assumption that write-behind data is not typically :> accessed by the program immediately after being written -- an assumption :> that winds up being incorrect in the DBM case you tested and resulting :> in stalls due to the buffer / VM pages being locked during the write I/O. :> The stalls are *not* due to the I/O itself but instead are due to side :> effects of the I/O being in-progress. : :And that sounds a heck of a lot like what those of us who have been :running INN news swervers with 1,1GB size text history files on 2.whazzit :(now dead, may it rest in pieces widely-scattered) and later have seen. : :You should have forgotten that a couple months or so ago, I wrote to :one of these lists to ask why I was getting only about 50-70% :availability as my 1.5+MD5-based-dbz innd was stuck in ufslck2 during :these every-30-seconds syncs. The .hash and .index files from this, :which are comparable to the dbm (dbz) files being typically 125MB and :85MB or so, this under 3.4-STABLE. : :Well, I've meant to get around to trying 4.0 on it, and Real Soon Now :I will, but I wanted to relate my experiences in turning traitor, a :heretic who has left the fold, deserving to be ridden out of town on :a rail and stuff, which sounds like a lot of fun. I tried NetBSD. : :NetBSD (at least the development now 1.4V version) has trickle :syncing, which seems to work quite well when having to cope with :these rather large database files, keeping a full 14 days of message :IDs from a full news feed. Personally speaking I agree with you in regards to the syncer code. I don't have time to fix it, though I suspect it would not be difficult. Trickle syncing is an inherently easy thing to do. Kirk and I have both had serious trouble with the syncer daemon not being able to smooth out write I/O's due to it fsync'ing whole files all in one go. The buffer daemon does a much better job which is why the speedup_syncer stuff is being slowly depreciated in favor of bd_speedup(). For INN there are several things you can tune in 4.0. First and foremost you can try turning off the write-behind code, sysctl -w vfs.write_behind=0. Secondly you can mess around with the vfs.hidirtybuffers sysctl (generally lower it) in order to force out dirty pages earlier and thus reduce the number that fsync has to deal with. I believe that INN also messes around with shared/R+W mmap()'s - it may be possible to add MAP_NOSYNC to those maps to turn off the 30 second fsync on pages dirtied through the VM system (for those maps), though this may increase the amount of stale (unwritten) data after a crash. :There. I've confessed. It feels really good. Now have at me. : :Naturally, since I haven't followed this discussion closely, you may :be talking about something completely different, but I did want to :mention generally improved (yet not totally perfect) performance :with huge INN database files and NetBSD's trickle syncing. Now, :go out and steal some k0deZ, okay? : : :barry bouwsma, tele danMerika internet -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Mar 22 23:35:56 2000 Delivered-To: freebsd-fs@freebsd.org Received: from muzak.iinet.net.au (muzak.iinet.net.au [203.59.24.237]) by hub.freebsd.org (Postfix) with ESMTP id 7822237B574; Wed, 22 Mar 2000 23:35:39 -0800 (PST) (envelope-from julian@elischer.org) Received: from jules.elischer.org (reggae-09-79.nv.iinet.net.au [203.59.67.79]) by muzak.iinet.net.au (8.8.5/8.8.5) with SMTP id PAA30777; Thu, 23 Mar 2000 15:35:26 +0800 Message-ID: <38D9B306.2781E494@elischer.org> Date: Wed, 22 Mar 2000 23:34:11 -0800 From: Julian Elischer X-Mailer: Mozilla 3.04Gold (X11; I; FreeBSD 5.0-CURRENT i386) MIME-Version: 1.0 To: Mike Smith Cc: Matthew Dillon , Paul Richards , Richard Wendland , Alfred Perlstein , Poul-Henning Kamp , current@freebsd.org, fs@freebsd.org Subject: Re: FreeBSD random I/O performance issues References: <200003222039.MAA00661@mass.cdrom.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org This is one of the things that made us do so badly in the benchmarks against NT/Linux last year. OBVIOUSLY one should be able to re-read this infoirmation without affecting a pending write. Mike Smith wrote: > > > effects of the I/O being in-progress. If a user program doesn't access > > any of the information it recently wrote the whole mechanism winds up > > operating asynchronously in the background. If a user program does, > > then the write behind mechanism breaks down and you get a stall. > > What makes no sense is that it should be perfectly ok to _read_ this > information back. > -- __--_|\ Julian Elischer / \ julian@elischer.org ( OZ ) World tour 2000 ---> X_.---._/ presently in: Perth v To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Mar 23 1:51: 4 2000 Delivered-To: freebsd-fs@freebsd.org Received: from trinity.skynet.be (trinity.skynet.be [195.238.2.38]) by hub.freebsd.org (Postfix) with ESMTP id A8A2A37C3E6; Thu, 23 Mar 2000 01:50:59 -0800 (PST) (envelope-from blk@skynet.be) Received: from [195.238.1.121] (brad.techos.skynet.be [195.238.1.121]) by trinity.skynet.be (Postfix) with ESMTP id 6DBEC1814B; Thu, 23 Mar 2000 10:50:34 +0100 (MET) Mime-Version: 1.0 X-Sender: blk@pop.skynet.be Message-Id: In-Reply-To: References: Date: Thu, 23 Mar 2000 10:34:42 +0100 To: FreeBSD-abusers@netscum.dk, Matthew Dillon From: Brad Knowles Subject: Re: FreeBSD random I/O performance issues Cc: current@FreeBSD.ORG, fs@FreeBSD.ORG Content-Type: text/plain; charset="us-ascii" ; format="flowed" Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org At 1:17 AM +0100 2000/3/23, FREENIX IS OVERRATED wrote: > Without really tuning anything, after a bit of time, the time needed > to do history lookups drops to microseconds, and as long as a `sync' > isn't needed, innd doesn't get stuck. Theoretically, a sync, where > you are in fact seeking rather wildly over the disk to update these > files, happens once a day at expiry. Depending on the speed of the > drive (and I haven't optimized this at all, using a single drive for > OS, logs, history, and part of spool, with a second drive for the rest > of the spool, far from an ideal setup), this seems to mean only a > few minutes of downtime. Actually building the new .index and .hash > files goes quite a bit faster, like by a factor of 3 to 4, so clearly > the update of these files during the `sync' could stand improved sorting. There are those of us running Diablo that solve this sort of problem on our main news peering servers by having the entire history file stored on a memory-based filesystem, so that we can sustain 1000-2000 history lookups per second. Obviously, this solution is not scalable to news spool servers, because you can't afford to lose the history file for a months worth of news, but the current mmap() based solution for the indexes of the history database seems to cause much more disk accesses than I would like to see. Perhaps this would be a good application for md? -- These are my opinions -- not to be taken as official Skynet policy ====================================================================== Brad Knowles, || Belgacom Skynet SA/NV Systems Architect, Mail/News/FTP/Proxy Admin || Rue Colonel Bourg, 124 Phone/Fax: +32-2-706.13.11/12.49 || B-1140 Brussels http://www.skynet.be || Belgium To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Mar 24 19:11:57 2000 Delivered-To: freebsd-fs@freebsd.org Received: from io.dreamscape.com (io.dreamscape.com [206.64.128.6]) by hub.freebsd.org (Postfix) with ESMTP id 2309637B6F6 for ; Fri, 24 Mar 2000 19:11:38 -0800 (PST) (envelope-from krentel@dreamscape.com) Received: from dreamscape.com (sA20-p50.dreamscape.com [209.217.200.242]) by io.dreamscape.com (8.9.3/8.8.4) with ESMTP id WAA25887; Fri, 24 Mar 2000 22:10:45 -0500 (EST) X-Dreamscape-Track-A: sA20-p50.dreamscape.com [209.217.200.242] X-Dreamscape-Track-B: Fri, 24 Mar 2000 22:10:45 -0500 (EST) Received: (from krentel@localhost) by dreamscape.com (8.9.3/8.9.3) id WAA00537; Fri, 24 Mar 2000 22:10:50 -0500 (EST) (envelope-from krentel) Date: Fri, 24 Mar 2000 22:10:50 -0500 (EST) From: "Mark W. Krentel" Message-Id: <200003250310.WAA00537@dreamscape.com> To: freebsd-fs@FreeBSD.ORG Subject: ext2fs optional features Cc: kwc@world.std.com Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org This question was asked in -stable a couple days ago, but it really belongs in -fs. Recently, some changes were made to the ext2fs support that prohibit R/W mounts for some newer ext2fs partitions with optional features. I've seen this with Red Hat 6.1 and Slackware 7. Red Hat 6.0 seems to use an older format. This is what Linux's tune2fs reports: # tune2fs -l /dev/sdb2 tune2fs 1.15, 18-Jul-1999 for EXT2 FS 0.5b, 95/08/09 Filesystem volume name: Last mounted on: Filesystem UUID: 38a27662-0012-11d4-8f7a-ead76bc87798 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: sparse_super Filesystem state: not clean Errors behavior: Continue Filesystem OS type: Linux ... And this is what appears in the logs: Mar 24 21:36:47 blue /kernel: WARNING: R/W mount of dev 0x3040a denied due to unsupported optional features What are the optional features? What does "sparse_super" do? Does Linux actually use these features, or are they for future use? Is it possible to support R/W mounts with these features? I remember 3.4-release let me mount the same filesystem R/W. Was I unknowingly corrupting the filesystem, or running some risk of a panic? I noticed that tune2fs also reported: Block size: 4096 Fragment size: 4096 Does Linux really not support fragments?? I was stunned. Much thanks for any answers. --Mark Krentel To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Mar 25 1:48:20 2000 Delivered-To: freebsd-fs@freebsd.org Received: from mail.cs.tu-berlin.de (mail.cs.tu-berlin.de [130.149.17.13]) by hub.freebsd.org (Postfix) with ESMTP id 3F5F537B6D1 for ; Sat, 25 Mar 2000 01:48:11 -0800 (PST) (envelope-from loewis@cs.tu-berlin.de) Received: from rubel.cs.tu-berlin.de (loewis@rubel.cs.tu-berlin.de [130.149.20.46]) by mail.cs.tu-berlin.de (8.9.3/8.9.3) with ESMTP id KAA00572; Sat, 25 Mar 2000 10:45:01 +0100 (MET) Received: (from loewis@localhost) by rubel.cs.tu-berlin.de (8.9.3/8.9.3) id KAA29526; Sat, 25 Mar 2000 10:44:56 +0100 (MET) Date: Sat, 25 Mar 2000 10:44:56 +0100 (MET) Message-Id: <200003250944.KAA29526@rubel.cs.tu-berlin.de> X-Authentication-Warning: rubel.cs.tu-berlin.de: loewis set sender to loewis@cs.tu-berlin.de using -f From: "Martin v.Loewis" To: krentel@dreamscape.com Cc: freebsd-fs@FreeBSD.ORG, kwc@world.std.com In-reply-to: <200003250310.WAA00537@dreamscape.com> (krentel@dreamscape.com) Subject: Re: ext2fs optional features References: <200003250310.WAA00537@dreamscape.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org What are the optional features? What does "sparse_super" do? Does Linux actually use these features, or are they for future use? Ext2 has three feature sets: compatible features, r/o compatible features, and incompatible features. If an ext2 implementation sees a volume that has a feature it does not recognize, it should act accordingly: If the feature is compatible, go ahead an mount the volume. If the feature is r/o compatible, refuse to mount r/w. If the feature is incompatible, refuse to mount at all. Currently (e2fstools 1.18), the following features are defined #define EXT2_FEATURE_COMPAT_DIR_PREALLOC 0x0001 #define EXT2_FEATURE_COMPAT_IMAGIC_INODES 0x0002 #define EXT3_FEATURE_COMPAT_HAS_JOURNAL 0x0004 #define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001 #define EXT2_FEATURE_RO_COMPAT_LARGE_FILE 0x0002 #define EXT2_FEATURE_RO_COMPAT_BTREE_DIR 0x0004 #define EXT2_FEATURE_INCOMPAT_COMPRESSION 0x0001 #define EXT2_FEATURE_INCOMPAT_FILETYPE 0x0002 #define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004 The sparse_super option means that not every block group has a super block, but only those that are powers of 3, 5, or 7, and block group 0. The feature is ro-compatible, since an implementation can mount the file system when it finds a valid super block; it is not compatible, since the implementation will overwrite data when it attempts to write-back the super blocks into groups where none belong. Of the features above, Linux 2.3.99pre2 supports the following ones: #define EXT2_FEATURE_COMPAT_SUPP 0 #define EXT2_FEATURE_INCOMPAT_SUPP EXT2_FEATURE_INCOMPAT_FILETYPE #define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \ EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \ EXT2_FEATURE_RO_COMPAT_BTREE_DIR) Whether these features are activated on a certain installation primarily depends on the default settings that the distributor (RedHat, Debian, ...) has selected. Regards, Martin To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Mar 25 2:27:58 2000 Delivered-To: freebsd-fs@freebsd.org Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by hub.freebsd.org (Postfix) with ESMTP id 96E9037B61A for ; Sat, 25 Mar 2000 02:27:54 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102]) by mailman.zeta.org.au (8.8.7/8.8.7) with ESMTP id VAA14626; Sat, 25 Mar 2000 21:35:40 +1100 Date: Sat, 25 Mar 2000 21:27:28 +1100 (EST) From: Bruce Evans X-Sender: bde@alphplex.bde.org To: "Mark W. Krentel" Cc: freebsd-fs@FreeBSD.ORG, kwc@world.std.com Subject: Re: ext2fs optional features In-Reply-To: <200003250310.WAA00537@dreamscape.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Fri, 24 Mar 2000, Mark W. Krentel wrote: > ... > And this is what appears in the logs: > > Mar 24 21:36:47 blue /kernel: WARNING: R/W mount of dev 0x3040a > denied due to unsupported optional features > > What are the optional features? What does "sparse_super" do? They are extensions that modify the filesystem format. I don't know exactly what "sparse_super" does. FreeBSD's ext2fs knows even less. > Does Linux actually use these features, or are they for future use? Linux has supported the ext2fs "filetype" and "sparse_super" features for several years. Otherwise, they wouldn't be the default for the current version of mkfs.ext2fs. > Is it possible to support R/W mounts with these features? Everything is possible in software :-). > I remember 3.4-release let me mount the same filesystem R/W. Was I That was a bug in 3.4 :-). > unknowingly corrupting the filesystem, or running some risk of a panic? The "filetype" extension caused panics. I don't know what the "sparse_super" extension caused. > I noticed that tune2fs also reported: > > Block size: 4096 > Fragment size: 4096 > > Does Linux really not support fragments?? I was stunned. Fragments are a dubious feature. They were more useful when 100MB disks were large. Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Mar 25 15:38:13 2000 Delivered-To: freebsd-fs@freebsd.org Received: from io.dreamscape.com (io.dreamscape.com [206.64.128.6]) by hub.freebsd.org (Postfix) with ESMTP id 9E37037B533 for ; Sat, 25 Mar 2000 15:38:10 -0800 (PST) (envelope-from krentel@dreamscape.com) Received: from dreamscape.com (sA19-p23.dreamscape.com [209.217.200.86]) by io.dreamscape.com (8.9.3/8.8.4) with ESMTP id SAA15662; Sat, 25 Mar 2000 18:37:20 -0500 (EST) X-Dreamscape-Track-A: sA19-p23.dreamscape.com [209.217.200.86] X-Dreamscape-Track-B: Sat, 25 Mar 2000 18:37:20 -0500 (EST) Received: (from krentel@localhost) by dreamscape.com (8.9.3/8.9.3) id SAA05240; Sat, 25 Mar 2000 18:37:23 -0500 (EST) (envelope-from krentel) Date: Sat, 25 Mar 2000 18:37:23 -0500 (EST) From: "Mark W. Krentel" Message-Id: <200003252337.SAA05240@dreamscape.com> To: freebsd-fs@FreeBSD.ORG Subject: Re: ext2fs optional features Cc: bde@zeta.org.au, kwc@world.std.com, loewis@cs.tu-berlin.de Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > Is it possible to support R/W mounts with these features? > Everything is possible in software :-). I guess I was really asking if some Freebsd developer was working on supporting some of these features so that the mounts can be R/W legitimately. I'd offer to help, but it would only slow you down. :-) > Currently (e2fstools 1.18), the following features are defined What is e2fstools? Is this a Linux package? Lastly, does anyone know what will happen with ext3fs? Will Freebsd be able to read or write it? --Mark Krentel To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message