From owner-freebsd-geom@FreeBSD.ORG Mon Mar 4 11:06:42 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 53863E9E for ; Mon, 4 Mar 2013 11:06:42 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 37D13E44 for ; Mon, 4 Mar 2013 11:06:42 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r24B6gGL038742 for ; Mon, 4 Mar 2013 11:06:42 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r24B6f4O038740 for freebsd-geom@FreeBSD.org; Mon, 4 Mar 2013 11:06:41 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 4 Mar 2013 11:06:41 GMT Message-Id: <201303041106.r24B6f4O038740@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-geom@FreeBSD.org Subject: Current problem reports assigned to freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Mar 2013 11:06:42 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/171865 geom [geom] [patch] g_wither_washer() keeping a core busy o kern/170038 geom [geom] geom_mirror always starts degraded after reboot o kern/169539 geom [geom] [patch] fix ability to run gmirror on MSI MegaR a bin/169077 geom bsdinstall(8) does not use partition labels in /etc/fs f kern/165745 geom [geom] geom_multipath page fault on removed drive o kern/165428 geom [glabel][patch] Add xfs support to glabel o kern/164254 geom [geom] gjournal not stopping on GPT partitions o kern/164252 geom [geom] gjournal overflow o kern/164143 geom [geom] Partition table not recognized after upgrade R8 a kern/163020 geom [geli] [patch] enable the Camellia-XTS on GEOM ELI o kern/162690 geom [geom] gpart label changes only take effect after a re o kern/162010 geom [geli] panic: Provider's error should be set (error=0) o kern/161979 geom [geom] glabel doesn't update after newfs, and glabel s o kern/161752 geom [geom] glabel(8) doesn't get gpt label change o bin/161677 geom gpart(8) Probably bug in gptboot o kern/160409 geom [geli] failed to attach provider f kern/159595 geom [geom] [panic] panic on gmirror unload in vbox [regres f kern/159414 geom [isp] isp(4)+gmultipath(8) : removing active fiber pat p kern/158398 geom [headers] [patch] includes o kern/158197 geom [geom] geom_cache with size>1000 leads to panics o kern/157879 geom [libgeom] [regression] ABI change without version bump o kern/157863 geom [geli] kbdmux prevents geli passwords from being enter o kern/157739 geom [geom] GPT labels with geom_multipath o kern/157724 geom [geom] gpart(8) 'add' command must preserve gap for sc o kern/157723 geom [geom] GEOM should not process 'c' (raw) partitions fo o kern/157108 geom [gjournal] dumpon(8) fails on gjournal providers o kern/155994 geom [geom] Long "Suspend time" when reading large files fr o kern/154226 geom [geom] GEOM label does not change when you modify them o kern/150858 geom [geom] [geom_label] [patch] glabel(8) is not compatibl o kern/150626 geom [geom] [gjournal] gjournal(8) destroys label o kern/150555 geom [geom] gjournal unusable on GPT partitions o kern/150334 geom [geom] [udf] [patch] geom label does not support UDF o kern/149762 geom volume labels with rogue characters o bin/149215 geom [panic] [geom_part] gpart(8): Delete linux's slice via o kern/147667 geom [gmirror] Booting with one component of a gmirror, the o kern/145818 geom [geom] geom_stat_open showing cached information for n o kern/145042 geom [geom] System stops booting after printing message "GE o kern/143455 geom gstripe(8) in RELENG_8 (31st Jan 2010) broken o kern/142563 geom [geom] [hang] ioctl freeze in zpool o kern/141740 geom [geom] gjournal(8): g_journal_destroy concurrent error o kern/140352 geom [geom] gjournal + glabel not working o kern/135898 geom [geom] Severe filesystem corruption - large files or l o kern/134113 geom [geli] Problem setting secondary GELI key o kern/133931 geom [geli] [request] intentionally wrong password to destr o bin/132845 geom [geom] [patch] ggated(8) does not close files opened a o bin/131415 geom [geli] keystrokes are unregulary sent to Geli when typ o kern/131353 geom [geom] gjournal(8) kernel lock o kern/129674 geom [geom] gjournal root did not mount on boot o kern/129645 geom gjournal(8): GEOM_JOURNAL causes system to fail to boo o kern/129245 geom [geom] gcache is more suitable for suffix based provid o kern/127420 geom [geom] [gjournal] [panic] Journal overflow on gmirrore o kern/124973 geom [gjournal] [patch] boot order affects geom_journal con o kern/124969 geom gvinum(8): gvinum raid5 plex does not detect missing s o kern/123962 geom [panic] [gjournal] gjournal (455Gb data, 8Gb journal), o kern/123122 geom [geom] GEOM / gjournal kernel lock o kern/122738 geom [geom] gmirror list "losts consumers" after gmirror de o kern/122067 geom [geom] [panic] Geom crashed during boot o kern/121364 geom [gmirror] Removing all providers create a "zombie" mir o kern/120091 geom [geom] [geli] [gjournal] geli does not prompt for pass o kern/115856 geom [geli] ZFS thought it was degraded when it should have o kern/115547 geom [geom] [patch] [request] let GEOM Eli get password fro o kern/113837 geom [geom] unable to access 1024 sector size storage o kern/113419 geom [geom] geom fox multipathing not failing back o kern/107707 geom [geom] [patch] [request] add new class geom_xbox360 to o kern/94632 geom [geom] Kernel output resets input while GELI asks for o kern/90582 geom [geom] [panic] Restore cause panic string (ffs_blkfree o bin/90093 geom fdisk(8) incapable of altering in-core geometry o kern/87544 geom [gbde] mmaping large files on a gbde filesystem deadlo o bin/86388 geom [geom] [geom_part] periodic(8) daily should backup gpa o kern/84556 geom [geom] [panic] GBDE-encrypted swap causes panic at shu o kern/79251 geom [2TB] newfs fails on 2.6TB gbde device o kern/79035 geom [vinum] gvinum unable to create a striped set of mirro o bin/78131 geom gbde(8) "destroy" not working. 73 problems total. From owner-freebsd-geom@FreeBSD.ORG Wed Mar 6 07:15:23 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 86796950; Wed, 6 Mar 2013 07:15:23 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 5220C6F9; Wed, 6 Mar 2013 07:15:23 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r267FDHS015118; Tue, 5 Mar 2013 23:15:17 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201303060715.r267FDHS015118@gw.catspoiler.org> Date: Tue, 5 Mar 2013 23:15:13 -0800 (PST) From: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! To: lev@FreeBSD.org In-Reply-To: <612776324.20130301152756@serebryakov.spb.ru> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=iso-8859-5 Content-Transfer-Encoding: 8BIT Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Mar 2013 07:15:23 -0000 On 1 Mar, Lev Serebryakov wrote: > Hello, Ivan. > You wrote 28 февраля 2013 г., 21:01:46: > >>> One time, Kirk say, that delayed writes are Ok for SU until bottom >>> layer doesn't lie about operation completeness. geom_raid5 could >>> delay writes (in hope that next writes will combine nicely and allow >>> not to do read-calculate-write cycle for read alone), but it never >>> mark BIO complete until it is really completed (layers down to >>> geom_raid5 returns completion). So, every BIO in wait queue is "in >>> flight" from GEOM/VFS point of view. Maybe, it is fatal for journal :( > IV> It shouldn't be - it could be a bug. > I understand, that it proves nothing, but I've tried to repeat > "previous crash corrupt FS in journal-undetectable way" theory by > killing virtual system when there is massive writing to > geom_radi5-based FS (on virtual drives, unfortunately). I've done 15 > tries (as it is manual testing, it takes about 1-1.5 hours total), > but every time FS was Ok after double-fsck (first with journal and > last without one). Of course, there was MASSIVE loss of data, as > timeout and size of cache in geom_raid5 was set very high (sometimes > FS becomes empty after unpacking 50% of SVN mirror seed, crash and > check) but FS was consistent every time! Did you have any power failures that took down the system sometime before this panic occured? By default FreeBSD enables write caching on ATA drives. kern.cam.ada.write_cache: 1 kern.cam.ada.0.write_cache: -1 (-1 => use system default value) That means that the drive will immediately acknowledge writes and is free to reorder them as it pleases. When UFS+SU allocates a new inode, it first clears the available bit in the bitmap and writes the bitmap block to disk before it writes the new inode contents to disk. When a file is deleted, the inode is zeroed on disk before the available bit is set in the bitmap and the bitmap block is written. That means that if an inode is marked as available in the bitmap, then it should be zero. The system panic that you experienced happened when the system was attempting to allocate an inode for a new file and when it peeked at an inode that was marked as available, it found that the inode was non-zero. What might have happened is that sometime in the past, the system was in the process of creating a new file when a power failure ocurred. It found an available inode, marked it as unavailable in the bitmap, and write the bitmap block to the drive. Because write caching was enabled, the bitmap block was cached in the drive's write cache, and the drive said that the write was complete. After getting this response, UFS+SU wrote the new inode contents to the drive, which was also cached. The drive then wrote the inode contents to the drive. At this point the power failed, losing all of the contents of the drive write cache before the bitmap block was updated. When the system was powered up again, fsck just replayed the journal because you were using SU+J, and didn't detect the inconsistency between the bitmap and the actual inode contents (which would require a full fsck). This damage could remain latent for quite some time, and wouldn't be found until the filesystem tried to allocate the inode in question. From owner-freebsd-geom@FreeBSD.ORG Wed Mar 6 07:32:55 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 94015FC0; Wed, 6 Mar 2013 07:32:55 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [IPv6:2a01:4f8:131:60a2::2]) by mx1.freebsd.org (Postfix) with ESMTP id 58F607A9; Wed, 6 Mar 2013 07:32:55 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:9421:367:9d7d:512b]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 3CB884AC58; Wed, 6 Mar 2013 11:32:53 +0400 (MSK) Date: Wed, 6 Mar 2013 11:32:50 +0400 From: Lev Serebryakov Organization: FreeBSD Project X-Priority: 3 (Normal) Message-ID: <1644513757.20130306113250@serebryakov.spb.ru> To: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! In-Reply-To: <201303060715.r267FDHS015118@gw.catspoiler.org> References: <612776324.20130301152756@serebryakov.spb.ru> <201303060715.r267FDHS015118@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-5 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: lev@FreeBSD.org List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Mar 2013 07:32:55 -0000 Hello, Don. You wrote 6 =DC=D0=E0=E2=D0 2013 =D3., 11:15:13: DL> Did you have any power failures that took down the system sometime DL> before this panic occured? By default FreeBSD enables write caching on I had other panic due to my inaccurate hands... But I don't remeber any power failures, as I havee UPS and this one works (I check it every month). DL> That means that the drive will immediately acknowledge writes and is DL> free to reorder them as it pleases. DL> When UFS+SU allocates a new inode, it first clears the available bit in DL> the bitmap and writes the bitmap block to disk before it writes the new DL> inode contents to disk. When a file is deleted, the inode is zeroed on DL> disk before the available bit is set in the bitmap and the bitmap block DL> is written. That means that if an inode is marked as available in the DL> bitmap, then it should be zero. The system panic that you experienced DL> happened when the system was attempting to allocate an inode for a new DL> file and when it peeked at an inode that was marked as available, it DL> found that the inode was non-zero. DL> What might have happened is that sometime in the past, the system was in >[SKIPPED] DL> tried to allocate the inode in question. This scenario looks plausible, but it raises another question: does barriers will protect against it? It doesn't look so, as now barrier write is issued only when new inode BLOCK is allocated. And it leads us to my other question: why did not mark such vital writes with flag, which will force driver to mark them as "uncacheable" (And same for fsync()-inducted writes). Again, not BIO_FLUSH, which should flush whole cache, but flag for BIO. I was told my mav@ (ahci driver author) that ATA has such capability. And I'm sure, that SCSI/SAS drives should have one too. --=20 // Black Lion AKA Lev Serebryakov From owner-freebsd-geom@FreeBSD.ORG Wed Mar 6 08:15:28 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id F2A7192C; Wed, 6 Mar 2013 08:15:27 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id BBF69932; Wed, 6 Mar 2013 08:15:27 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r268FIl5015220; Wed, 6 Mar 2013 00:15:22 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201303060815.r268FIl5015220@gw.catspoiler.org> Date: Wed, 6 Mar 2013 00:15:18 -0800 (PST) From: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! To: lev@FreeBSD.org In-Reply-To: <1644513757.20130306113250@serebryakov.spb.ru> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=iso-8859-5 Content-Transfer-Encoding: 8BIT Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Mar 2013 08:15:28 -0000 On 6 Mar, Lev Serebryakov wrote: > Hello, Don. > You wrote 6 марта 2013 г., 11:15:13: > > DL> Did you have any power failures that took down the system sometime > DL> before this panic occured? By default FreeBSD enables write caching on > I had other panic due to my inaccurate hands... But I don't remeber > any power failures, as I havee UPS and this one works (I check it every > month). > > DL> That means that the drive will immediately acknowledge writes and is > DL> free to reorder them as it pleases. > > DL> When UFS+SU allocates a new inode, it first clears the available bit in > DL> the bitmap and writes the bitmap block to disk before it writes the new > DL> inode contents to disk. When a file is deleted, the inode is zeroed on > DL> disk before the available bit is set in the bitmap and the bitmap block > DL> is written. That means that if an inode is marked as available in the > DL> bitmap, then it should be zero. The system panic that you experienced > DL> happened when the system was attempting to allocate an inode for a new > DL> file and when it peeked at an inode that was marked as available, it > DL> found that the inode was non-zero. > > DL> What might have happened is that sometime in the past, the system was in >>[SKIPPED] > DL> tried to allocate the inode in question. > This scenario looks plausible, but it raises another question: does > barriers will protect against it? It doesn't look so, as now barrier > write is issued only when new inode BLOCK is allocated. And it leads > us to my other question: why did not mark such vital writes with > flag, which will force driver to mark them as "uncacheable" (And same > for fsync()-inducted writes). Again, not BIO_FLUSH, which should > flush whole cache, but flag for BIO. I was told my mav@ (ahci driver > author) that ATA has such capability. And I'm sure, that SCSI/SAS drives > should have one too. In the existing implementation, barriers wouldn't help since they aren't used in enough nearly enough places. UFS+SU currently expects the drive to tell it when the data actually hits the platter so that it can control the write ordering. In theory, barriers could be used instead, but performance would be terrible if they got turned into cache flushes. With NCQ or TCQ, the drive can have a sizeable number of writes internally queued and it is free to reorder them as it pleases even with write caching disabled, but if write caching is disabled it has to delay the notification of their completion until the data is on the platters so that UFS+SU can enforce the proper dependency ordering. Many years ago, when UFS+SU was fairly new, I experimented with enabling and disabling write caching on a SCSI drive with TCQ. Performance was about the same either way. I always disabled write caching on my SCSI drives after that because that is what UFS+SU expectes so that it can avoid inconsistencies in the case of power failure. I don't know enough about ATA to say if it supports marking individual writes as uncacheable. To support consistency on a drive with write caching enabled, UFS+SU would have to mark many of its writes as uncacheable. Even if this works, calls to fsync() would have to be turned into cache flushes to force the file data (assuming that it was was written with a cacheable write) to be written to the platters and only return to the userland program after the data is written. If drive write caching is off, then UFS+SU keeps track of the outstanding writes and an fsync() call won't return until the drive notifies UFS+SU that the data blocks for that file are actually written. In this case, the fsync() call doesn't need to get propagated down to the drive. From owner-freebsd-geom@FreeBSD.ORG Wed Mar 6 08:41:44 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id B7998FBD; Wed, 6 Mar 2013 08:41:44 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [46.4.40.135]) by mx1.freebsd.org (Postfix) with ESMTP id 42F9EA2D; Wed, 6 Mar 2013 08:41:44 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:9421:367:9d7d:512b]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 2FD304AC57; Wed, 6 Mar 2013 12:41:42 +0400 (MSK) Date: Wed, 6 Mar 2013 12:41:39 +0400 From: Lev Serebryakov Organization: FreeBSD Project X-Priority: 3 (Normal) Message-ID: <1198028260.20130306124139@serebryakov.spb.ru> To: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! In-Reply-To: <201303060815.r268FIl5015220@gw.catspoiler.org> References: <1644513757.20130306113250@serebryakov.spb.ru> <201303060815.r268FIl5015220@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-5 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: lev@FreeBSD.org List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Mar 2013 08:41:44 -0000 Hello, Don. You wrote 6 =DC=D0=E0=E2=D0 2013 =D3., 12:15:18: >> This scenario looks plausible, but it raises another question: does >> barriers will protect against it? It doesn't look so, as now barrier >> write is issued only when new inode BLOCK is allocated. And it leads >> us to my other question: why did not mark such vital writes with >> flag, which will force driver to mark them as "uncacheable" (And same >> for fsync()-inducted writes). Again, not BIO_FLUSH, which should >> flush whole cache, but flag for BIO. I was told my mav@ (ahci driver >> author) that ATA has such capability. And I'm sure, that SCSI/SAS drives >> should have one too. DL> In the existing implementation, barriers wouldn't help since they aren't DL> used in enough nearly enough places. UFS+SU currently expects the drive DL> to tell it when the data actually hits the platter so that it can DL> control the write ordering. In theory, barriers could be used instead, DL> but performance would be terrible if they got turned into cache flushes. Yep! So, we need stream (file/vnode/inode)-related barriers or simple per-request (bp/bio) flag added. DL> With NCQ or TCQ, the drive can have a sizeable number of writes DL> internally queued and it is free to reorder them as it pleases even with DL> write caching disabled, but if write caching is disabled it has to delay DL> the notification of their completion until the data is on the platters DL> so that UFS+SU can enforce the proper dependency ordering. But, again, performance would be terrible :( I've checked it. On very sparse multi-threaded patterns (multiple torrents download on fast channel in my simple home case, and, I think, things could be worse in case of big file server in organization) and "simple" SATA drives it significant worse in my experience :( DL> I don't know enough about ATA to say if it supports marking individual DL> writes as uncacheable. To support consistency on a drive with write DL> caching enabled, UFS+SU would have to mark many of its writes as DL> uncacheable. Even if this works, calls to fsync() would have to be I don't see this as a big problem. I've done some experiments about one and half year ago by adding counter all overs UFS/FFS code when it writes metadata, and it was about 1% of writes on busy file system (torrents, csup update, buildworld, all on one big FS). DL> turned into cache flushes to force the file data (assuming that it was DL> was written with a cacheable write) to be written to the platters and DL> only return to the userland program after the data is written. If drive DL> write caching is off, then UFS+SU keeps track of the outstanding writes DL> and an fsync() call won't return until the drive notifies UFS+SU that DL> the data blocks for that file are actually written. In this case, the DL> fsync() call doesn't need to get propagated down to the drive. I see. But then we should turn off disc cache by default. And write some whitepaper about this situation. I don't know what is better for commodity SATA drives, really. And I'm not sure, that I understand UFS/FFS code good enough to do proper experiment by adding such flag to whole our storage stack :( And second problem: SSD. I know nothing about their caching strategies, and SSDs has very big RAM buffers compared to commodity HDDs (something like 512MiB vs 64MiB). --=20 // Black Lion AKA Lev Serebryakov From owner-freebsd-geom@FreeBSD.ORG Wed Mar 6 10:01:18 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 8636FCC5; Wed, 6 Mar 2013 10:01:18 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 4E327E6C; Wed, 6 Mar 2013 10:01:17 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r26A18n3015414; Wed, 6 Mar 2013 02:01:12 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201303061001.r26A18n3015414@gw.catspoiler.org> Date: Wed, 6 Mar 2013 02:01:08 -0800 (PST) From: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! To: lev@FreeBSD.org In-Reply-To: <1198028260.20130306124139@serebryakov.spb.ru> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Mar 2013 10:01:18 -0000 On 6 Mar, Lev Serebryakov wrote: > DL> With NCQ or TCQ, the drive can have a sizeable number of writes > DL> internally queued and it is free to reorder them as it pleases even with > DL> write caching disabled, but if write caching is disabled it has to delay > DL> the notification of their completion until the data is on the platters > DL> so that UFS+SU can enforce the proper dependency ordering. > But, again, performance would be terrible :( I've checked it. On > very sparse multi-threaded patterns (multiple torrents download on > fast channel in my simple home case, and, I think, things could be > worse in case of big file server in organization) and "simple" SATA > drives it significant worse in my experience :( I'm surprised that a typical drive would have enough onboard cache for write caching to help signficantly in that situation. Is the torrent software doing a lot of fsync() calls? Those would essentially turn into NOPs if write caching is enabled, but would stall the thread until the data hits the platter if write caching is disabled. One limitation of NCQ is that it only supports 32 simultaneous commands. With write caching enabled, you might be able to stuff more writes into the drive's onboard memory so that it can do a better job of optimizing the ordering and increase it's number of I/O's per second, though I wouldn't expect miracles. A SAS drive and controller with TCQ would support more simultaneous commands and might also perform better. Creating a file by writing it in random order is fairly expensive. Each time a new block is written by the application, UFS+SU has to first find a free block by searching the block bitmaps, mark that block as allocated, wait for that write of the bitmap block to complete, write the data to that block, wait for that to complete, and then write the block pointer to the inode or an indirect block. Because of the random write ordering, there is probably not enough locality to do coalesce multiple updates to the bitmap and indirect blocks into one write before the syncer interval expires. These operations all happen in the background after the write() call, but once you hit the I/O per second limit of the drive, eventually enough backlog builds to stall the application. Also, if another update needs to be done to a block that the syncer has queued for writing, that may also cause a stall until the write completes. If you hack the torrent software to create and pre-zero each file before it starts downloading it, then each bitmap and indirect block will probably only get written once during that operation and won't get written again during the actual download, and zeroing the data blocks will be sequential and fast. During the download, the only writes will be to the data blocks, so you might see something like a 3x performance improvement. From owner-freebsd-geom@FreeBSD.ORG Wed Mar 6 12:53:44 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7DA67461; Wed, 6 Mar 2013 12:53:44 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [46.4.40.135]) by mx1.freebsd.org (Postfix) with ESMTP id 08C7D918; Wed, 6 Mar 2013 12:53:43 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:9421:367:9d7d:512b]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 12C844AC57; Wed, 6 Mar 2013 16:53:39 +0400 (MSK) Date: Wed, 6 Mar 2013 16:53:37 +0400 From: Lev Serebryakov Organization: FreeBSD Project X-Priority: 3 (Normal) Message-ID: <1402477662.20130306165337@serebryakov.spb.ru> To: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! In-Reply-To: <201303061001.r26A18n3015414@gw.catspoiler.org> References: <1198028260.20130306124139@serebryakov.spb.ru> <201303061001.r26A18n3015414@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: lev@FreeBSD.org List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Mar 2013 12:53:44 -0000 Hello, Don. You wrote 6 =D0=BC=D0=B0=D1=80=D1=82=D0=B0 2013 =D0=B3., 14:01:08: >> DL> With NCQ or TCQ, the drive can have a sizeable number of writes >> DL> internally queued and it is free to reorder them as it pleases even = with >> DL> write caching disabled, but if write caching is disabled it has to d= elay >> DL> the notification of their completion until the data is on the platte= rs >> DL> so that UFS+SU can enforce the proper dependency ordering. >> But, again, performance would be terrible :( I've checked it. On >> very sparse multi-threaded patterns (multiple torrents download on >> fast channel in my simple home case, and, I think, things could be >> worse in case of big file server in organization) and "simple" SATA >> drives it significant worse in my experience :( DL> I'm surprised that a typical drive would have enough onboard cache for DL> write caching to help signficantly in that situation. Is the torrent It is 5x64MiB in my case, oh, effectively, 4x64MiB :) Really, I could repeat experiment with some predictable and repeatable benchmark. What in out ports could be used for massively-parallel (16+ files) random (with blocks like 64KiB and file sizes like 2+GiB) but "repeatable" benchmark? DL> software doing a lot of fsync() calls? Those would essentially turn Nope. It trys to avoid fsync(), of course DL> Creating a file by writing it in random order is fairly expensive. Each DL> time a new block is written by the application, UFS+SU has to first find DL> a free block by searching the block bitmaps, mark that block as DL> allocated, wait for that write of the bitmap block to complete, write DL> the data to that block, wait for that to complete, and then write the DL> block pointer to the inode or an indirect block. Because of the random DL> write ordering, there is probably not enough locality to do coalesce DL> multiple updates to the bitmap and indirect blocks into one write before DL> the syncer interval expires. These operations all happen in the DL> background after the write() call, but once you hit the I/O per second DL> limit of the drive, eventually enough backlog builds to stall the DL> application. Also, if another update needs to be done to a block that DL> the syncer has queued for writing, that may also cause a stall until the DL> write completes. If you hack the torrent software to create and DL> pre-zero each file before it starts downloading it, then each bitmap and DL> indirect block will probably only get written once during that operation DL> and won't get written again during the actual download, and zeroing the DL> data blocks will be sequential and fast. During the download, the only DL> writes will be to the data blocks, so you might see something like a 3x DL> performance improvement. My client (transmission, from ports) is configured to do "real preallocation" (not sparse one), but it doesn't help much. It surely limited by disk I/O :( But anyway, torrent client is bad benchmark if we start to speak about some real experiments to decide what could be improved in FFS/GEOM stack, as it is not very repeatable. --=20 // Black Lion AKA Lev Serebryakov From owner-freebsd-geom@FreeBSD.ORG Sat Mar 9 03:04:07 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9F8F3AFA; Sat, 9 Mar 2013 03:04:07 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 7C4A5878; Sat, 9 Mar 2013 03:04:07 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r2933qqJ032330; Fri, 8 Mar 2013 19:03:56 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201303090303.r2933qqJ032330@gw.catspoiler.org> Date: Fri, 8 Mar 2013 19:03:52 -0800 (PST) From: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! To: lev@FreeBSD.org In-Reply-To: <1402477662.20130306165337@serebryakov.spb.ru> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=iso-8859-5 Content-Transfer-Encoding: 8BIT Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Mar 2013 03:04:07 -0000 On 6 Mar, Lev Serebryakov wrote: > Hello, Don. > You wrote 6 марта 2013 г., 14:01:08: > >>> DL> With NCQ or TCQ, the drive can have a sizeable number of writes >>> DL> internally queued and it is free to reorder them as it pleases even with >>> DL> write caching disabled, but if write caching is disabled it has to delay >>> DL> the notification of their completion until the data is on the platters >>> DL> so that UFS+SU can enforce the proper dependency ordering. >>> But, again, performance would be terrible :( I've checked it. On >>> very sparse multi-threaded patterns (multiple torrents download on >>> fast channel in my simple home case, and, I think, things could be >>> worse in case of big file server in organization) and "simple" SATA >>> drives it significant worse in my experience :( > > DL> I'm surprised that a typical drive would have enough onboard cache for > DL> write caching to help signficantly in that situation. Is the torrent > It is 5x64MiB in my case, oh, effectively, 4x64MiB :) > Really, I could repeat experiment with some predictable and > repeatable benchmark. What in out ports could be used for > massively-parallel (16+ files) random (with blocks like 64KiB and > file sizes like 2+GiB) but "repeatable" benchmark? I don't happen to know of any benchmark software in ports for this, but I haven't really looked. > DL> software doing a lot of fsync() calls? Those would essentially turn > Nope. It trys to avoid fsync(), of course > > DL> Creating a file by writing it in random order is fairly expensive. Each > DL> time a new block is written by the application, UFS+SU has to first find > DL> a free block by searching the block bitmaps, mark that block as > DL> allocated, wait for that write of the bitmap block to complete, write > DL> the data to that block, wait for that to complete, and then write the > DL> block pointer to the inode or an indirect block. Because of the random > DL> write ordering, there is probably not enough locality to do coalesce > DL> multiple updates to the bitmap and indirect blocks into one write before > DL> the syncer interval expires. These operations all happen in the > DL> background after the write() call, but once you hit the I/O per second > DL> limit of the drive, eventually enough backlog builds to stall the > DL> application. Also, if another update needs to be done to a block that > DL> the syncer has queued for writing, that may also cause a stall until the > DL> write completes. If you hack the torrent software to create and > DL> pre-zero each file before it starts downloading it, then each bitmap and > DL> indirect block will probably only get written once during that operation > DL> and won't get written again during the actual download, and zeroing the > DL> data blocks will be sequential and fast. During the download, the only > DL> writes will be to the data blocks, so you might see something like a 3x > DL> performance improvement. > My client (transmission, from ports) is configured to do "real > preallocation" (not sparse one), but it doesn't help much. It surely > limited by disk I/O :( > But anyway, torrent client is bad benchmark if we start to speak > about some real experiments to decide what could be improved in > FFS/GEOM stack, as it is not very repeatable. I seem to recall that you mentioning that the raid5 geom layer is doing a lot of caching, presumably to coalesce writes. If this causes the responses to writes to be delayed too much, then the geom layer could end up starved for writes because the vfs.hirunningspace limit will be reached. If this happens, you'll see threads waiting on wdrain. You could also monitor vfs.runningbufspace to see how close it is getting to the limit. If this is the problem, you might want to try cranking up the value of vfs.hirunningspace to see if it helps. One thing that doesn't seem to fit this theory is that if the raid5 layer is doing a lot of caching to try to do write coalescing, then I wouldn't expect that the extra write completion latency caused by turning off write caching in the drives would make much of a difference. Another possibility is that you might be running into the 32 NCQ command limit when write caching is off. With write caching on, then you can probably shove a lot more write commands into the drive before being blocked. That might help the drive get a bit higher iops, but I wouldn't expect a big difference. It could also be that when you hit the limit that you end up blocking read commands from getting sent to the drives, which causes whatever is depending on the data to stall. The gstat command allows the queue length and number of reads and writes to be monitored, but I don't know of a way to monitor the number of read and write commands that the drive has in its internal queue. Something else to look at is what problems might the delayed write completion notifications from the drives cause in the raid5 layer itself. Could that be preventing the raid5 layer from sending other I/O commands to the drives? Between the time a write command has been sent to a drive and the drive reports the completion of the write, what happens if something wants to touch that buffer? What size writes does the application typically do? What is the UFS blocksize? What is the raid5 stripe size? With this access pattern, you may get poor results if the stripe size is much greater than the block and write sizes. From owner-freebsd-geom@FreeBSD.ORG Sat Mar 9 12:08:28 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 0F228E58; Sat, 9 Mar 2013 12:08:28 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [46.4.40.135]) by mx1.freebsd.org (Postfix) with ESMTP id C1A0ACD4; Sat, 9 Mar 2013 12:08:27 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:8571:2d32:217f:d124]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 53E394AC57; Sat, 9 Mar 2013 16:08:21 +0400 (MSK) Date: Sat, 9 Mar 2013 16:08:17 +0400 From: Lev Serebryakov Organization: FreeBSD Project X-Priority: 3 (Normal) Message-ID: <1809201254.20130309160817@serebryakov.spb.ru> To: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! In-Reply-To: <201303090303.r2933qqJ032330@gw.catspoiler.org> References: <1402477662.20130306165337@serebryakov.spb.ru> <201303090303.r2933qqJ032330@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-5 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: lev@FreeBSD.org List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Mar 2013 12:08:28 -0000 Hello, Don. You wrote 9 =DC=D0=E0=E2=D0 2013 =D3., 7:03:52: >> But anyway, torrent client is bad benchmark if we start to speak >> about some real experiments to decide what could be improved in >> FFS/GEOM stack, as it is not very repeatable. DL> I seem to recall that you mentioning that the raid5 geom layer is doing DL> a lot of caching, presumably to coalesce writes. If this causes the DL> responses to writes to be delayed too much, then the geom layer could DL> end up starved for writes because the vfs.hirunningspace limit will be DL> reached. If this happens, you'll see threads waiting on wdrain. You DL> could also monitor vfs.runningbufspace to see how close it is getting to DL> the limit. If this is the problem, you might want to try cranking up Strangely enough, vfs.runningbufspace is always zero, even under load. My geom_raid5 is configured to dealy writes up to 15 seconds... DL> Something else to look at is what problems might the delayed write DL> completion notifications from the drives cause in the raid5 layer DL> itself. Could that be preventing the raid5 layer from sending other I/O DL> commands to the drives? Between the time a write command has been sent Nope. It should not. I'm not sure for 100%, as I picked up these sources from original author and sources are rather cryptic, but I could not see any throttling in it. DL> to a drive and the drive reports the completion of the write, what DL> happens if something wants to touch that buffer? DL> What size writes does the application typically do? What is the UFS 64K writes, 32K blocksize, 128K stripe size... Now I'm analyzing traces from this device to understand exact write patterns. DL> blocksize? What is the raid5 stripe size? With this access pattern, DL> you may get poor results if the stripe size is much greater than the DL> block and write sizes. --=20 // Black Lion AKA Lev Serebryakov