From owner-freebsd-geom@FreeBSD.ORG  Mon Mar  4 11:06:42 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 53863E9E
 for <freebsd-geom@FreeBSD.org>; Mon,  4 Mar 2013 11:06:42 +0000 (UTC)
 (envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by mx1.freebsd.org (Postfix) with ESMTP id 37D13E44
 for <freebsd-geom@FreeBSD.org>; Mon,  4 Mar 2013 11:06:42 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r24B6gGL038742
 for <freebsd-geom@FreeBSD.org>; Mon, 4 Mar 2013 11:06:42 GMT
 (envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
 by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r24B6f4O038740
 for freebsd-geom@FreeBSD.org; Mon, 4 Mar 2013 11:06:41 GMT
 (envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 4 Mar 2013 11:06:41 GMT
Message-Id: <201303041106.r24B6f4O038740@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
 owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@freebsd.org>
To: freebsd-geom@FreeBSD.org
Subject: Current problem reports assigned to freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 04 Mar 2013 11:06:42 -0000

Note: to view an individual PR, use:
  http://www.freebsd.org/cgi/query-pr.cgi?pr=(number).

The following is a listing of current problems submitted by FreeBSD users.
These represent problem reports covering all versions including
experimental development code and obsolete releases.


S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/171865  geom       [geom] [patch] g_wither_washer() keeping a core busy
o kern/170038  geom       [geom] geom_mirror always starts degraded after reboot
o kern/169539  geom       [geom] [patch] fix ability to run gmirror on MSI MegaR
a bin/169077   geom       bsdinstall(8) does not use partition labels in /etc/fs
f kern/165745  geom       [geom] geom_multipath page fault on removed drive
o kern/165428  geom       [glabel][patch] Add xfs support to glabel
o kern/164254  geom       [geom] gjournal not stopping on GPT partitions
o kern/164252  geom       [geom] gjournal overflow
o kern/164143  geom       [geom] Partition table not recognized after upgrade R8
a kern/163020  geom       [geli] [patch] enable the Camellia-XTS on GEOM ELI
o kern/162690  geom       [geom] gpart label changes only take effect after a re
o kern/162010  geom       [geli] panic: Provider's error should be set (error=0)
o kern/161979  geom       [geom] glabel doesn't update after newfs, and glabel s
o kern/161752  geom       [geom] glabel(8) doesn't get gpt label change
o bin/161677   geom       gpart(8) Probably bug in gptboot
o kern/160409  geom       [geli] failed to attach provider
f kern/159595  geom       [geom] [panic] panic on gmirror unload in vbox [regres
f kern/159414  geom       [isp] isp(4)+gmultipath(8) : removing active fiber pat
p kern/158398  geom       [headers] [patch] <geom/geom.h> includes <sys/sbuf.h> 
o kern/158197  geom       [geom] geom_cache with size>1000 leads to panics
o kern/157879  geom       [libgeom] [regression] ABI change without version bump
o kern/157863  geom       [geli] kbdmux prevents geli passwords from being enter
o kern/157739  geom       [geom] GPT labels with geom_multipath
o kern/157724  geom       [geom] gpart(8) 'add' command must preserve gap for sc
o kern/157723  geom       [geom] GEOM should not process 'c' (raw) partitions fo
o kern/157108  geom       [gjournal] dumpon(8) fails on gjournal providers
o kern/155994  geom       [geom] Long "Suspend time" when reading large files fr
o kern/154226  geom       [geom] GEOM label does not change when you modify them
o kern/150858  geom       [geom] [geom_label] [patch] glabel(8) is not compatibl
o kern/150626  geom       [geom] [gjournal] gjournal(8) destroys label
o kern/150555  geom       [geom] gjournal unusable on GPT partitions
o kern/150334  geom       [geom] [udf] [patch] geom label does not support UDF
o kern/149762  geom       volume labels with rogue characters
o bin/149215   geom       [panic] [geom_part] gpart(8): Delete linux's slice via
o kern/147667  geom       [gmirror] Booting with one component of a gmirror, the
o kern/145818  geom       [geom] geom_stat_open showing cached information for n
o kern/145042  geom       [geom] System stops booting after printing message "GE
o kern/143455  geom       gstripe(8) in RELENG_8 (31st Jan 2010) broken
o kern/142563  geom       [geom] [hang] ioctl freeze in zpool
o kern/141740  geom       [geom] gjournal(8): g_journal_destroy concurrent error
o kern/140352  geom       [geom] gjournal + glabel not working
o kern/135898  geom       [geom] Severe filesystem corruption - large files or l
o kern/134113  geom       [geli] Problem setting secondary GELI key
o kern/133931  geom       [geli] [request] intentionally wrong password to destr
o bin/132845   geom       [geom] [patch] ggated(8) does not close files opened a
o bin/131415   geom       [geli] keystrokes are unregulary sent to Geli when typ
o kern/131353  geom       [geom] gjournal(8) kernel lock
o kern/129674  geom       [geom] gjournal root did not mount on boot
o kern/129645  geom       gjournal(8): GEOM_JOURNAL causes system to fail to boo
o kern/129245  geom       [geom] gcache is more suitable for suffix based provid
o kern/127420  geom       [geom] [gjournal] [panic] Journal overflow on gmirrore
o kern/124973  geom       [gjournal] [patch] boot order affects geom_journal con
o kern/124969  geom       gvinum(8): gvinum raid5 plex does not detect missing s
o kern/123962  geom       [panic] [gjournal] gjournal (455Gb data, 8Gb journal),
o kern/123122  geom       [geom] GEOM / gjournal kernel lock
o kern/122738  geom       [geom] gmirror list "losts consumers" after gmirror de
o kern/122067  geom       [geom] [panic] Geom crashed during boot
o kern/121364  geom       [gmirror] Removing all providers create a "zombie" mir
o kern/120091  geom       [geom] [geli] [gjournal] geli does not prompt for pass
o kern/115856  geom       [geli] ZFS thought it was degraded when it should have
o kern/115547  geom       [geom] [patch] [request] let GEOM Eli get password fro
o kern/113837  geom       [geom] unable to access 1024 sector size storage
o kern/113419  geom       [geom] geom fox multipathing not failing back
o kern/107707  geom       [geom] [patch] [request] add new class geom_xbox360 to
o kern/94632   geom       [geom] Kernel output resets input while GELI asks for 
o kern/90582   geom       [geom] [panic] Restore cause panic string (ffs_blkfree
o bin/90093    geom       fdisk(8) incapable of altering in-core geometry
o kern/87544   geom       [gbde] mmaping large files on a gbde filesystem deadlo
o bin/86388    geom       [geom] [geom_part] periodic(8) daily should backup gpa
o kern/84556   geom       [geom] [panic] GBDE-encrypted swap causes panic at shu
o kern/79251   geom       [2TB] newfs fails on 2.6TB gbde device
o kern/79035   geom       [vinum] gvinum unable to create a striped set of mirro
o bin/78131    geom       gbde(8) "destroy" not working.

73 problems total.


From owner-freebsd-geom@FreeBSD.ORG  Wed Mar  6 07:15:23 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 86796950;
 Wed,  6 Mar 2013 07:15:23 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242])
 by mx1.freebsd.org (Postfix) with ESMTP id 5220C6F9;
 Wed,  6 Mar 2013 07:15:23 +0000 (UTC)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
 by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r267FDHS015118;
 Tue, 5 Mar 2013 23:15:17 -0800 (PST)
 (envelope-from truckman@FreeBSD.org)
Message-Id: <201303060715.r267FDHS015118@gw.catspoiler.org>
Date: Tue, 5 Mar 2013 23:15:13 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift
 topic to ZFS!
To: lev@FreeBSD.org
In-Reply-To: <612776324.20130301152756@serebryakov.spb.ru>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=iso-8859-5
Content-Transfer-Encoding: 8BIT
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Mar 2013 07:15:23 -0000

On  1 Mar, Lev Serebryakov wrote:
> Hello, Ivan.
> You wrote 28 февраля 2013 г., 21:01:46:
> 
>>>   One time, Kirk say, that delayed writes are Ok for SU until bottom
>>>  layer doesn't lie about operation completeness. geom_raid5 could
>>>  delay writes (in hope that next writes will combine nicely and allow
>>>  not to do read-calculate-write cycle for read alone), but it never
>>>  mark BIO complete until it is really completed (layers down to
>>>  geom_raid5 returns completion). So, every BIO in wait queue is "in
>>>  flight" from GEOM/VFS point of view. Maybe, it is fatal for journal :(
> IV> It shouldn't be - it could be a bug.
>    I understand, that it proves nothing, but I've tried to repeat
>  "previous crash corrupt FS in journal-undetectable way" theory by
>  killing virtual system when there is massive writing to
>  geom_radi5-based FS (on virtual drives, unfortunately). I've done 15
>  tries (as it is manual testing, it takes about 1-1.5 hours total),
>  but every time FS was Ok after double-fsck (first with journal and
>  last without one). Of course, there was MASSIVE loss of data, as
>  timeout and size of cache in geom_raid5 was set very high (sometimes
>  FS becomes empty after unpacking 50% of SVN mirror seed, crash and
>  check) but FS was consistent every time!

Did you have any power failures that took down the system sometime
before this panic occured?  By default FreeBSD enables write caching on
ATA drives.

	kern.cam.ada.write_cache: 1
	kern.cam.ada.0.write_cache: -1  (-1 => use system default value)

That means that the drive will immediately acknowledge writes and is
free to reorder them as it pleases.

When UFS+SU allocates a new inode, it first clears the available bit in
the bitmap and writes the bitmap block to disk before it writes the new
inode contents to disk.  When a file is deleted, the inode is zeroed on
disk before the available bit is set in the bitmap and the bitmap block
is written.  That means that if an inode is marked as available in the
bitmap, then it should be zero.  The system panic that you experienced
happened when the system was attempting to allocate an inode for a new
file and when it peeked at an inode that was marked as available, it
found that the inode was non-zero.

What might have happened is that sometime in the past, the system was in
the process of creating a new file when a power failure ocurred.  It
found an available inode, marked it as unavailable in the bitmap, and
write the bitmap block to the drive. Because write caching was enabled,
the bitmap block was cached in the drive's write cache, and the drive
said that the write was complete. After getting this response, UFS+SU
wrote the new inode contents to the drive, which was also cached.  The
drive then wrote the inode contents to the drive. At this point the
power failed, losing all of the contents of the drive write cache before
the bitmap block was updated.  When the system was powered up again,
fsck just replayed the journal because you were using SU+J, and didn't
detect the inconsistency between the bitmap and the actual inode
contents (which would require a full fsck).  This damage could remain
latent for quite some time, and wouldn't be found until the filesystem
tried to allocate the inode in question.


From owner-freebsd-geom@FreeBSD.ORG  Wed Mar  6 07:32:55 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 94015FC0;
 Wed,  6 Mar 2013 07:32:55 +0000 (UTC) (envelope-from lev@FreeBSD.org)
Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru
 [IPv6:2a01:4f8:131:60a2::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 58F607A9;
 Wed,  6 Mar 2013 07:32:55 +0000 (UTC)
Received: from lion.home.serebryakov.spb.ru (unknown
 [IPv6:2001:470:923f:1:9421:367:9d7d:512b])
 (Authenticated sender: lev@serebryakov.spb.ru)
 by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 3CB884AC58;
 Wed,  6 Mar 2013 11:32:53 +0400 (MSK)
Date: Wed, 6 Mar 2013 11:32:50 +0400
From: Lev Serebryakov <lev@FreeBSD.org>
Organization: FreeBSD Project
X-Priority: 3 (Normal)
Message-ID: <1644513757.20130306113250@serebryakov.spb.ru>
To: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please,
 don't shift topic to ZFS!
In-Reply-To: <201303060715.r267FDHS015118@gw.catspoiler.org>
References: <612776324.20130301152756@serebryakov.spb.ru>
 <201303060715.r267FDHS015118@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-5
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: lev@FreeBSD.org
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Mar 2013 07:32:55 -0000

Hello, Don.
You wrote 6 =DC=D0=E0=E2=D0 2013 =D3., 11:15:13:

DL> Did you have any power failures that took down the system sometime
DL> before this panic occured?  By default FreeBSD enables write caching on
  I  had other panic due to my inaccurate hands... But I don't remeber
any  power  failures, as I havee UPS and this one works (I check it every
month).

DL> That means that the drive will immediately acknowledge writes and is
DL> free to reorder them as it pleases.

DL> When UFS+SU allocates a new inode, it first clears the available bit in
DL> the bitmap and writes the bitmap block to disk before it writes the new
DL> inode contents to disk.  When a file is deleted, the inode is zeroed on
DL> disk before the available bit is set in the bitmap and the bitmap block
DL> is written.  That means that if an inode is marked as available in the
DL> bitmap, then it should be zero.  The system panic that you experienced
DL> happened when the system was attempting to allocate an inode for a new
DL> file and when it peeked at an inode that was marked as available, it
DL> found that the inode was non-zero.

DL> What might have happened is that sometime in the past, the system was in
>[SKIPPED]
DL> tried to allocate the inode in question.
  This  scenario  looks plausible, but it raises another question: does
 barriers will protect against it? It doesn't look so, as now barrier
 write is issued only when new inode BLOCK is allocated. And it leads
 us to my other question: why did not mark such vital writes with
 flag, which will force driver to mark them as "uncacheable" (And same
 for fsync()-inducted writes). Again, not BIO_FLUSH, which should
 flush whole cache, but flag for BIO. I was told my mav@ (ahci driver
 author) that ATA has such capability. And I'm sure, that SCSI/SAS drives
 should have one too.

--=20
// Black Lion AKA Lev Serebryakov <lev@FreeBSD.org>


From owner-freebsd-geom@FreeBSD.ORG  Wed Mar  6 08:15:28 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id F2A7192C;
 Wed,  6 Mar 2013 08:15:27 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242])
 by mx1.freebsd.org (Postfix) with ESMTP id BBF69932;
 Wed,  6 Mar 2013 08:15:27 +0000 (UTC)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
 by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r268FIl5015220;
 Wed, 6 Mar 2013 00:15:22 -0800 (PST)
 (envelope-from truckman@FreeBSD.org)
Message-Id: <201303060815.r268FIl5015220@gw.catspoiler.org>
Date: Wed, 6 Mar 2013 00:15:18 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift
 topic to ZFS!
To: lev@FreeBSD.org
In-Reply-To: <1644513757.20130306113250@serebryakov.spb.ru>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=iso-8859-5
Content-Transfer-Encoding: 8BIT
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Mar 2013 08:15:28 -0000

On  6 Mar, Lev Serebryakov wrote:
> Hello, Don.
> You wrote 6 марта 2013 г., 11:15:13:
> 
> DL> Did you have any power failures that took down the system sometime
> DL> before this panic occured?  By default FreeBSD enables write caching on
>   I  had other panic due to my inaccurate hands... But I don't remeber
> any  power  failures, as I havee UPS and this one works (I check it every
> month).
> 
> DL> That means that the drive will immediately acknowledge writes and is
> DL> free to reorder them as it pleases.
> 
> DL> When UFS+SU allocates a new inode, it first clears the available bit in
> DL> the bitmap and writes the bitmap block to disk before it writes the new
> DL> inode contents to disk.  When a file is deleted, the inode is zeroed on
> DL> disk before the available bit is set in the bitmap and the bitmap block
> DL> is written.  That means that if an inode is marked as available in the
> DL> bitmap, then it should be zero.  The system panic that you experienced
> DL> happened when the system was attempting to allocate an inode for a new
> DL> file and when it peeked at an inode that was marked as available, it
> DL> found that the inode was non-zero.
> 
> DL> What might have happened is that sometime in the past, the system was in
>>[SKIPPED]
> DL> tried to allocate the inode in question.
>   This  scenario  looks plausible, but it raises another question: does
>  barriers will protect against it? It doesn't look so, as now barrier
>  write is issued only when new inode BLOCK is allocated. And it leads
>  us to my other question: why did not mark such vital writes with
>  flag, which will force driver to mark them as "uncacheable" (And same
>  for fsync()-inducted writes). Again, not BIO_FLUSH, which should
>  flush whole cache, but flag for BIO. I was told my mav@ (ahci driver
>  author) that ATA has such capability. And I'm sure, that SCSI/SAS drives
>  should have one too.

In the existing implementation, barriers wouldn't help since they aren't
used in enough nearly enough places.  UFS+SU currently expects the drive
to tell it when the data actually hits the platter so that it can
control the write ordering.  In theory, barriers could be used instead,
but performance would be terrible if they got turned into cache flushes.

With NCQ or TCQ, the drive can have a sizeable number of writes
internally queued and it is free to reorder them as it pleases even with
write caching disabled, but if write caching is disabled it has to delay
the notification of their completion until the data is on the platters
so that UFS+SU can enforce the proper dependency ordering.

Many years ago, when UFS+SU was fairly new, I experimented with enabling
and disabling write caching on a SCSI drive with TCQ.  Performance was
about the same either way.  I always disabled write caching on my SCSI
drives after that because that is what UFS+SU expectes so that it can
avoid inconsistencies in the case of power failure.

I don't know enough about ATA to say if it supports marking individual
writes as uncacheable.  To support consistency on a drive with write
caching enabled, UFS+SU would have to mark many of its writes as
uncacheable.  Even if this works, calls to fsync() would have to be
turned into cache flushes to force the file data (assuming that it was
was written with a cacheable write) to be written to the platters and
only return to the userland program after the data is written.  If drive
write caching is off, then UFS+SU keeps track of the outstanding writes
and an fsync() call won't return until the drive notifies UFS+SU that
the data blocks for that file are actually written.  In this case, the
fsync() call doesn't need to get propagated down to the drive.


From owner-freebsd-geom@FreeBSD.ORG  Wed Mar  6 08:41:44 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id B7998FBD;
 Wed,  6 Mar 2013 08:41:44 +0000 (UTC) (envelope-from lev@FreeBSD.org)
Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru
 [46.4.40.135]) by mx1.freebsd.org (Postfix) with ESMTP id 42F9EA2D;
 Wed,  6 Mar 2013 08:41:44 +0000 (UTC)
Received: from lion.home.serebryakov.spb.ru (unknown
 [IPv6:2001:470:923f:1:9421:367:9d7d:512b])
 (Authenticated sender: lev@serebryakov.spb.ru)
 by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 2FD304AC57;
 Wed,  6 Mar 2013 12:41:42 +0400 (MSK)
Date: Wed, 6 Mar 2013 12:41:39 +0400
From: Lev Serebryakov <lev@FreeBSD.org>
Organization: FreeBSD Project
X-Priority: 3 (Normal)
Message-ID: <1198028260.20130306124139@serebryakov.spb.ru>
To: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please,
 don't shift topic to ZFS!
In-Reply-To: <201303060815.r268FIl5015220@gw.catspoiler.org>
References: <1644513757.20130306113250@serebryakov.spb.ru>
 <201303060815.r268FIl5015220@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-5
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: lev@FreeBSD.org
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Mar 2013 08:41:44 -0000

Hello, Don.
You wrote 6 =DC=D0=E0=E2=D0 2013 =D3., 12:15:18:

>>   This  scenario  looks plausible, but it raises another question: does
>>  barriers will protect against it? It doesn't look so, as now barrier
>>  write is issued only when new inode BLOCK is allocated. And it leads
>>  us to my other question: why did not mark such vital writes with
>>  flag, which will force driver to mark them as "uncacheable" (And same
>>  for fsync()-inducted writes). Again, not BIO_FLUSH, which should
>>  flush whole cache, but flag for BIO. I was told my mav@ (ahci driver
>>  author) that ATA has such capability. And I'm sure, that SCSI/SAS drives
>>  should have one too.
DL> In the existing implementation, barriers wouldn't help since they aren't
DL> used in enough nearly enough places.  UFS+SU currently expects the drive
DL> to tell it when the data actually hits the platter so that it can
DL> control the write ordering.  In theory, barriers could be used instead,
DL> but performance would be terrible if they got turned into cache flushes.
   Yep! So, we need stream (file/vnode/inode)-related barriers or
 simple per-request (bp/bio) flag added.

DL> With NCQ or TCQ, the drive can have a sizeable number of writes
DL> internally queued and it is free to reorder them as it pleases even with
DL> write caching disabled, but if write caching is disabled it has to delay
DL> the notification of their completion until the data is on the platters
DL> so that UFS+SU can enforce the proper dependency ordering.
  But, again, performance would be terrible :( I've checked it. On
 very sparse multi-threaded patterns (multiple torrents download on
 fast channel in my simple home case, and, I think, things could be
 worse in case of big file server in organization) and "simple" SATA
 drives it significant worse in my experience :(

DL> I don't know enough about ATA to say if it supports marking individual
DL> writes as uncacheable.  To support consistency on a drive with write
DL> caching enabled, UFS+SU would have to mark many of its writes as
DL> uncacheable.  Even if this works, calls to fsync() would have to be
   I don't see this as a big problem. I've done some experiments about
one and half year ago by adding counter all overs UFS/FFS code when it
writes metadata, and it was about 1% of writes on busy file system
(torrents, csup update, buildworld, all on one big FS).

DL> turned into cache flushes to force the file data (assuming that it was
DL> was written with a cacheable write) to be written to the platters and
DL> only return to the userland program after the data is written.  If drive
DL> write caching is off, then UFS+SU keeps track of the outstanding writes
DL> and an fsync() call won't return until the drive notifies UFS+SU that
DL> the data blocks for that file are actually written.  In this case, the
DL> fsync() call doesn't need to get propagated down to the drive.
  I see. But then we should turn off disc cache by default. And write
 some whitepaper about this situation. I don't know what is better for
 commodity SATA drives, really. And I'm not sure, that I understand
 UFS/FFS code good enough to do proper experiment by adding such flag
 to whole our storage stack :(

   And second problem: SSD. I know nothing about their caching
 strategies, and SSDs has very big RAM  buffers compared to commodity
 HDDs (something like 512MiB vs 64MiB).

--=20
// Black Lion AKA Lev Serebryakov <lev@FreeBSD.org>


From owner-freebsd-geom@FreeBSD.ORG  Wed Mar  6 10:01:18 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 8636FCC5;
 Wed,  6 Mar 2013 10:01:18 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242])
 by mx1.freebsd.org (Postfix) with ESMTP id 4E327E6C;
 Wed,  6 Mar 2013 10:01:17 +0000 (UTC)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
 by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r26A18n3015414;
 Wed, 6 Mar 2013 02:01:12 -0800 (PST)
 (envelope-from truckman@FreeBSD.org)
Message-Id: <201303061001.r26A18n3015414@gw.catspoiler.org>
Date: Wed, 6 Mar 2013 02:01:08 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift
 topic to ZFS!
To: lev@FreeBSD.org
In-Reply-To: <1198028260.20130306124139@serebryakov.spb.ru>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Mar 2013 10:01:18 -0000

On  6 Mar, Lev Serebryakov wrote:
 
> DL> With NCQ or TCQ, the drive can have a sizeable number of writes
> DL> internally queued and it is free to reorder them as it pleases even with
> DL> write caching disabled, but if write caching is disabled it has to delay
> DL> the notification of their completion until the data is on the platters
> DL> so that UFS+SU can enforce the proper dependency ordering.
>   But, again, performance would be terrible :( I've checked it. On
>  very sparse multi-threaded patterns (multiple torrents download on
>  fast channel in my simple home case, and, I think, things could be
>  worse in case of big file server in organization) and "simple" SATA
>  drives it significant worse in my experience :(

I'm surprised that a typical drive would have enough onboard cache for
write caching to help signficantly in that situation.  Is the torrent
software doing a lot of fsync() calls?  Those would essentially turn
into NOPs if write caching is enabled, but would stall the thread until
the data hits the platter if write caching is disabled.

One limitation of NCQ is that it only supports 32 simultaneous commands.
With write caching enabled, you might be able to stuff more writes into
the drive's onboard memory so that it can do a better job of optimizing
the ordering and increase it's number of I/O's per second, though I
wouldn't expect miracles.  A SAS drive and controller with TCQ would
support more simultaneous commands and might also perform better.

Creating a file by writing it in random order is fairly expensive.  Each
time a new block is written by the application, UFS+SU has to first find
a free block by searching the block bitmaps, mark that block as
allocated, wait for that write of the bitmap block to complete, write
the data to that block, wait for that to complete, and then write the
block pointer to the inode or an indirect block.  Because of the random
write ordering, there is probably not enough locality to do coalesce
multiple updates to the bitmap and indirect blocks into one write before
the syncer interval expires.  These operations all happen in the
background after the write() call, but once you hit the I/O per second
limit of the drive, eventually enough backlog builds to stall the
application.  Also, if another update needs to be done to a block that
the syncer has queued for writing, that may also cause a stall until the
write completes.  If you hack the torrent software to create and
pre-zero each file before it starts downloading it, then each bitmap and
indirect block will probably only get written once during that operation
and won't get written again during the actual download, and zeroing the
data blocks will be sequential and fast. During the download, the only
writes will be to the data blocks, so you might see something like a 3x
performance improvement.


From owner-freebsd-geom@FreeBSD.ORG  Wed Mar  6 12:53:44 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 7DA67461;
 Wed,  6 Mar 2013 12:53:44 +0000 (UTC) (envelope-from lev@FreeBSD.org)
Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru
 [46.4.40.135]) by mx1.freebsd.org (Postfix) with ESMTP id 08C7D918;
 Wed,  6 Mar 2013 12:53:43 +0000 (UTC)
Received: from lion.home.serebryakov.spb.ru (unknown
 [IPv6:2001:470:923f:1:9421:367:9d7d:512b])
 (Authenticated sender: lev@serebryakov.spb.ru)
 by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 12C844AC57;
 Wed,  6 Mar 2013 16:53:39 +0400 (MSK)
Date: Wed, 6 Mar 2013 16:53:37 +0400
From: Lev Serebryakov <lev@FreeBSD.org>
Organization: FreeBSD Project
X-Priority: 3 (Normal)
Message-ID: <1402477662.20130306165337@serebryakov.spb.ru>
To: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please,
 don't shift topic to ZFS!
In-Reply-To: <201303061001.r26A18n3015414@gw.catspoiler.org>
References: <1198028260.20130306124139@serebryakov.spb.ru>
 <201303061001.r26A18n3015414@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: lev@FreeBSD.org
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Mar 2013 12:53:44 -0000

Hello, Don.
You wrote 6 =D0=BC=D0=B0=D1=80=D1=82=D0=B0 2013 =D0=B3., 14:01:08:

>> DL> With NCQ or TCQ, the drive can have a sizeable number of writes
>> DL> internally queued and it is free to reorder them as it pleases even =
with
>> DL> write caching disabled, but if write caching is disabled it has to d=
elay
>> DL> the notification of their completion until the data is on the platte=
rs
>> DL> so that UFS+SU can enforce the proper dependency ordering.
>>   But, again, performance would be terrible :( I've checked it. On
>>  very sparse multi-threaded patterns (multiple torrents download on
>>  fast channel in my simple home case, and, I think, things could be
>>  worse in case of big file server in organization) and "simple" SATA
>>  drives it significant worse in my experience :(

DL> I'm surprised that a typical drive would have enough onboard cache for
DL> write caching to help signficantly in that situation.  Is the torrent
   It is 5x64MiB in my case, oh, effectively, 4x64MiB :)
   Really, I could repeat experiment with some predictable and
  repeatable benchmark. What in out ports could be used for
  massively-parallel (16+ files) random (with blocks like 64KiB and
  file sizes like 2+GiB) but "repeatable" benchmark?

DL> software doing a lot of fsync() calls?  Those would essentially turn
  Nope. It trys to avoid fsync(), of course

DL> Creating a file by writing it in random order is fairly expensive.  Each
DL> time a new block is written by the application, UFS+SU has to first find
DL> a free block by searching the block bitmaps, mark that block as
DL> allocated, wait for that write of the bitmap block to complete, write
DL> the data to that block, wait for that to complete, and then write the
DL> block pointer to the inode or an indirect block.  Because of the random
DL> write ordering, there is probably not enough locality to do coalesce
DL> multiple updates to the bitmap and indirect blocks into one write before
DL> the syncer interval expires.  These operations all happen in the
DL> background after the write() call, but once you hit the I/O per second
DL> limit of the drive, eventually enough backlog builds to stall the
DL> application.  Also, if another update needs to be done to a block that
DL> the syncer has queued for writing, that may also cause a stall until the
DL> write completes.  If you hack the torrent software to create and
DL> pre-zero each file before it starts downloading it, then each bitmap and
DL> indirect block will probably only get written once during that operation
DL> and won't get written again during the actual download, and zeroing the
DL> data blocks will be sequential and fast. During the download, the only
DL> writes will be to the data blocks, so you might see something like a 3x
DL> performance improvement.
   My client (transmission, from ports) is configured to do "real
  preallocation" (not sparse one), but it doesn't help much. It surely
  limited by disk I/O :(
    But anyway, torrent client is bad benchmark if we start to speak
  about some real experiments to decide what could be improved in
  FFS/GEOM stack, as it is not very repeatable.


--=20
// Black Lion AKA Lev Serebryakov <lev@FreeBSD.org>


From owner-freebsd-geom@FreeBSD.ORG  Sat Mar  9 03:04:07 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 9F8F3AFA;
 Sat,  9 Mar 2013 03:04:07 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242])
 by mx1.freebsd.org (Postfix) with ESMTP id 7C4A5878;
 Sat,  9 Mar 2013 03:04:07 +0000 (UTC)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
 by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r2933qqJ032330;
 Fri, 8 Mar 2013 19:03:56 -0800 (PST)
 (envelope-from truckman@FreeBSD.org)
Message-Id: <201303090303.r2933qqJ032330@gw.catspoiler.org>
Date: Fri, 8 Mar 2013 19:03:52 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift
 topic to ZFS!
To: lev@FreeBSD.org
In-Reply-To: <1402477662.20130306165337@serebryakov.spb.ru>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=iso-8859-5
Content-Transfer-Encoding: 8BIT
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Mar 2013 03:04:07 -0000

On  6 Mar, Lev Serebryakov wrote:
> Hello, Don.
> You wrote 6 марта 2013 г., 14:01:08:
> 
>>> DL> With NCQ or TCQ, the drive can have a sizeable number of writes
>>> DL> internally queued and it is free to reorder them as it pleases even with
>>> DL> write caching disabled, but if write caching is disabled it has to delay
>>> DL> the notification of their completion until the data is on the platters
>>> DL> so that UFS+SU can enforce the proper dependency ordering.
>>>   But, again, performance would be terrible :( I've checked it. On
>>>  very sparse multi-threaded patterns (multiple torrents download on
>>>  fast channel in my simple home case, and, I think, things could be
>>>  worse in case of big file server in organization) and "simple" SATA
>>>  drives it significant worse in my experience :(
> 
> DL> I'm surprised that a typical drive would have enough onboard cache for
> DL> write caching to help signficantly in that situation.  Is the torrent
>    It is 5x64MiB in my case, oh, effectively, 4x64MiB :)
>    Really, I could repeat experiment with some predictable and
>   repeatable benchmark. What in out ports could be used for
>   massively-parallel (16+ files) random (with blocks like 64KiB and
>   file sizes like 2+GiB) but "repeatable" benchmark?

I don't happen to know of any benchmark software in ports for this, but
I haven't really looked.

> DL> software doing a lot of fsync() calls?  Those would essentially turn
>   Nope. It trys to avoid fsync(), of course
> 
> DL> Creating a file by writing it in random order is fairly expensive.  Each
> DL> time a new block is written by the application, UFS+SU has to first find
> DL> a free block by searching the block bitmaps, mark that block as
> DL> allocated, wait for that write of the bitmap block to complete, write
> DL> the data to that block, wait for that to complete, and then write the
> DL> block pointer to the inode or an indirect block.  Because of the random
> DL> write ordering, there is probably not enough locality to do coalesce
> DL> multiple updates to the bitmap and indirect blocks into one write before
> DL> the syncer interval expires.  These operations all happen in the
> DL> background after the write() call, but once you hit the I/O per second
> DL> limit of the drive, eventually enough backlog builds to stall the
> DL> application.  Also, if another update needs to be done to a block that
> DL> the syncer has queued for writing, that may also cause a stall until the
> DL> write completes.  If you hack the torrent software to create and
> DL> pre-zero each file before it starts downloading it, then each bitmap and
> DL> indirect block will probably only get written once during that operation
> DL> and won't get written again during the actual download, and zeroing the
> DL> data blocks will be sequential and fast. During the download, the only
> DL> writes will be to the data blocks, so you might see something like a 3x
> DL> performance improvement.
>    My client (transmission, from ports) is configured to do "real
>   preallocation" (not sparse one), but it doesn't help much. It surely
>   limited by disk I/O :(
>     But anyway, torrent client is bad benchmark if we start to speak
>   about some real experiments to decide what could be improved in
>   FFS/GEOM stack, as it is not very repeatable.

I seem to recall that you mentioning that the raid5 geom layer is doing
a lot of caching, presumably to coalesce writes.  If this causes the
responses to writes to be delayed too much, then the geom layer could
end up starved for writes because the vfs.hirunningspace limit will be
reached.  If this happens, you'll see threads waiting on wdrain.  You
could also monitor vfs.runningbufspace to see how close it is getting to
the limit.  If this is the problem, you might want to try cranking up
the value of vfs.hirunningspace to see if it helps.  One thing that
doesn't seem to fit this theory is that if the raid5 layer is doing a
lot of caching to try to do write coalescing, then I wouldn't expect
that the extra write completion latency caused by turning off write
caching in the drives would make much of a difference.

Another possibility is that you might be running into the 32 NCQ command
limit when write caching is off.  With write caching on, then you can
probably shove a lot more write commands into the drive before being
blocked.  That might help the drive get a bit higher iops, but I
wouldn't expect a big difference.  It could also be that when you hit
the limit that you end up blocking read commands from getting sent to
the drives, which causes whatever is depending on the data to stall.
The gstat command allows the queue length and number of reads and writes
to be monitored, but I don't know of a way to monitor the number of read
and write commands that the drive has in its internal queue.

Something else to look at is what problems might the delayed write
completion notifications from the drives cause in the raid5 layer
itself.  Could that be preventing the raid5 layer from sending other I/O
commands to the drives?   Between the time a write command has been sent
to a drive and the drive reports the completion of the write, what
happens if something wants to touch that buffer?

What size writes does the application typically do?  What is the UFS
blocksize?  What is the raid5 stripe size?  With this access pattern,
you may get poor results if the stripe size is much greater than the
block and write sizes.


From owner-freebsd-geom@FreeBSD.ORG  Sat Mar  9 12:08:28 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 0F228E58;
 Sat,  9 Mar 2013 12:08:28 +0000 (UTC) (envelope-from lev@FreeBSD.org)
Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru
 [46.4.40.135]) by mx1.freebsd.org (Postfix) with ESMTP id C1A0ACD4;
 Sat,  9 Mar 2013 12:08:27 +0000 (UTC)
Received: from lion.home.serebryakov.spb.ru (unknown
 [IPv6:2001:470:923f:1:8571:2d32:217f:d124])
 (Authenticated sender: lev@serebryakov.spb.ru)
 by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 53E394AC57;
 Sat,  9 Mar 2013 16:08:21 +0400 (MSK)
Date: Sat, 9 Mar 2013 16:08:17 +0400
From: Lev Serebryakov <lev@FreeBSD.org>
Organization: FreeBSD Project
X-Priority: 3 (Normal)
Message-ID: <1809201254.20130309160817@serebryakov.spb.ru>
To: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please,
 don't shift topic to ZFS!
In-Reply-To: <201303090303.r2933qqJ032330@gw.catspoiler.org>
References: <1402477662.20130306165337@serebryakov.spb.ru>
 <201303090303.r2933qqJ032330@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-5
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: lev@FreeBSD.org
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Mar 2013 12:08:28 -0000

Hello, Don.
You wrote 9 =DC=D0=E0=E2=D0 2013 =D3., 7:03:52:


>>     But anyway, torrent client is bad benchmark if we start to speak
>>   about some real experiments to decide what could be improved in
>>   FFS/GEOM stack, as it is not very repeatable.
DL> I seem to recall that you mentioning that the raid5 geom layer is doing
DL> a lot of caching, presumably to coalesce writes.  If this causes the
DL> responses to writes to be delayed too much, then the geom layer could
DL> end up starved for writes because the vfs.hirunningspace limit will be
DL> reached.  If this happens, you'll see threads waiting on wdrain.  You
DL> could also monitor vfs.runningbufspace to see how close it is getting to
DL> the limit.  If this is the problem, you might want to try cranking up
  Strangely  enough,  vfs.runningbufspace  is  always zero, even under
load. My geom_raid5 is configured to dealy writes up to 15 seconds...

DL> Something else to look at is what problems might the delayed write
DL> completion notifications from the drives cause in the raid5 layer
DL> itself.  Could that be preventing the raid5 layer from sending other I/O
DL> commands to the drives?   Between the time a write command has been sent
   Nope. It should not. I'm not sure for 100%, as I picked up these
sources from original author and sources are rather cryptic, but I
could not see any throttling in it.
DL> to a drive and the drive reports the completion of the write, what
DL> happens if something wants to touch that buffer?

DL> What size writes does the application typically do?  What is the UFS
  64K writes, 32K blocksize, 128K stripe size... Now I'm analyzing
traces from this device to understand exact write patterns.
DL> blocksize?  What is the raid5 stripe size?  With this access pattern,
DL> you may get poor results if the stripe size is much greater than the
DL> block and write sizes.

--=20
// Black Lion AKA Lev Serebryakov <lev@FreeBSD.org>