Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 27 Sep 2008 04:03:29 -0700
From:      Jeremy Chadwick <koitsu@FreeBSD.org>
To:        Derek Kuli??ski <takeda@takeda.tk>
Cc:        freebsd-stable@FreeBSD.org, Clint Olsen <clint.olsen@gmail.com>
Subject:   Re: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY
Message-ID:  <20080927110329.GA50142@icarus.home.lan>
In-Reply-To: <588787159.20080927003750@takeda.tk>
References:  <20080921213426.GA13923@0lsen.net> <20080921215203.GC9494@icarus.home.lan> <20080921215930.GA25826@0lsen.net> <20080921220720.GA9847@icarus.home.lan> <249873145.20080926213341@takeda.tk> <20080927051413.GA42700@icarus.home.lan> <765067435.20080926223557@takeda.tk> <20080927064417.GA43638@icarus.home.lan> <588787159.20080927003750@takeda.tk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Sep 27, 2008 at 12:37:50AM -0700, Derek Kuli??ski wrote:
> Friday, September 26, 2008, 11:44:17 PM, you wrote:
> 
> >> As far as I know (at least ideally, when write caching is disabled)
> 
> > Re: write caching: wheelies and burn-outs in empty parking lots
> > detected.
> 
> > Let's be realistic.  We're talking about ATA and SATA hard disks, hooked
> > up to on-board controllers -- these are the majority of users.  Those
> > with ATA/SATA RAID controllers (not on-board RAID either; most/all of
> > those do not let you disable drive write caching) *might* have a RAID
> > BIOS menu item for disabling said feature.
> 
> > FreeBSD atacontrol does not let you toggle such features (although "cap"
> > will show you if feature is available and if it's enabled or not).
> 
> > Users using SCSI will most definitely have the ability to disable
> > said feature (either via SCSI BIOS or via camcontrol).  But the majority
> > of users are not using SCSI disks, because the majority of users are not
> > going to spend hundreds of dollars on a controller followed by hundreds
> > of dollars for a small (~74GB) disk.
> 
> > Regardless of all of this, end-users should, in no way shape or form,
> > be expected to go to great lengths to disable their disk's write cache.
> > They will not, I can assure you.  Thus, we must assume: write caching
> > on a disk will be enabled, period.  If a filesystem is engineered with
> > that fact ignored, then the filesystem is either 1) worthless, or 2)
> > serves a very niche purpose and should not be the default filesystem.
> 
> > Do we agree?
> 
> Yes, but...
> 
> In the link you sent to me, someone mentioned that write cache is
> always creates problem, and it doesn't matter on OS or filesystem.
> 
> There's more below.
> 
> >> the data should always be consistent, and all fsck supposed to be
> >> doing is to free unreferenced blocks that were allocated.
> > fsck does a heck of a lot more than that, and there's no guarantee
> > that's all fsck is going to do on a UFS2+SU filesystem.  I'm under the
> > impression it does a lot more than just looking for unref'd blocks.
> 
> Yes, fsck does a lot more than that. But the whole point of soft
> updates is to reduce the work of fsck to deallocate allocated blocks.
> 
> Anyway, maybe my information are invalid, though funny thing is that
> Soft Updates was mentioned in one of my lecture on Operating Systems.
> 
> Apparently the goal of Soft Updates is to always enforce those rules
> in very efficient manner, by reordering the writes:
> 1. Never point to a data structure before initializing it
> 2. Never reuse a structure before nullifying pointers to it
> 3. Never reset last pointer to live structure before setting a new one
> 4. Always mark free-block bitmap entries as used before making the
>    directory entry point to it
> 
> The problem comes with disks which for performance reasons cache the
> data and then write it in different order back to the disk.
> I think that's the reason why it's recommended to disable it.
> If a disk is reordering the writes, it renders the soft updates
> useless.
> 
> But if the writing order is preserved, all data remains always
> consistent, the only thing that might appear are blocks that were
> marked as being used, but nothing was pointing to them yet.
> 
> So (in ideal situation, when nothing interferes) all fsck needs to do
> is just to scan the filesystem and deallocate those blocks.
> 
> > The system is already up and the filesystems mounted.  If the error in
> > question is of such severity that it would impact a user's ability to
> > reliably use the filesystem, how do you expect constant screaming on
> > the console will help?  A user won't know what it means; there is
> > already evidence of this happening (re: mysterious ATA DMA errors which
> > still cannot be figured out[6]).
> 
> > IMHO, a dirty filesystem should not be mounted until it's been fully
> > analysed/scanned by fsck.  So again, people are putting faith into
> > UFS2+SU despite actual evidence proving that it doesn't handle all
> > scenarios.
> 
> Yes, I think the background fsck should be disabled by default, with a
> possibility to enable it if the user is sure that nothing will
> interfere with soft updates.
> 
> > The problem here is that when it was created, it was sort of an
> > "experiment".  Now, when someone installs FreeBSD, UFS2 is the default
> > filesystem used, and SU are enabled on every filesystem except the root
> > fs.  Thus, we have now put ourselves into a situation where said
> > feature ***must*** be reliable in all cases.
> 
> I think in worst case it just is as realiable as if it wouldn't be
> enabled (the only danger is the background fsck)
> 
> > You're also forgetting a huge focus of SU -- snapshots[1].  However, there
> > are more than enough facts on the table at this point concluding that
> > snapshots are causing more problems[7] than previously expected.  And
> > there's further evidence filesystem snapshots shouldn't even be used in
> > this way[8].
> 
> there's not much to argue about that.
> 
> >> Also, if I remember correctly, PJD said that gjournal is performing
> >> much better with small files, while softupdates is faster with big
> >> ones.
> 
> > Okay, so now we want to talk about benchmarks.  The benchmarks you're
> > talking about are in two places[2][3].
> 
> > The benchmarks pjd@ provided were very basic/simple, which I feel is
> > good, because the tests were realistic (common tasks people will do).
> > The benchmarks mckusick@ provided for UFS2+SU were based on SCSI
> > disks, which is... interesting to say the least.
> 
> > Bruce Evans responded with some more data[4].
> 
> > I particularly enjoy this quote in his benchmark: "I never found the
> > exact cause of the slower readback ...", followed by (plausible)
> > speculations as to why that is.
> 
> > I'm sorry that I sound like such a hard-ass on this matter, but there is
> > a glaring fact that people seem to be overlooking intentionally:
> 
> > Filesystems have to be reliable; data integrity is focus #1, and cannot
> > be sacrificed.  Users and administrators *expect* a filesystem to be
> > reliable.  No one is going to keep using a filesystem if it has
> > disadvantages which can result in data loss or "waste of administrative
> > time" (which I believe is what's occurring here).
> 
> > Users *will* switch to another operating system that has filesystems
> > which were not engineered/invented with these features in mind.  Or,
> > they can switch to another filesystem assuming the OS offers one which
> > performs equally as good/well and is guaranteed to be reliable --
> > and that's assuming the user wants to spend the time to reformat and
> > reinstall just to get that.
> 
> I wasn't trying to argue about that. Perhaps my assumption is wrong,
> but I belive that the problems that we know about Soft Updates, at
> worst case make system as reliable as it was without using it.
> 
> > In the case of "bit rot" (e.g. drive cache going bad silently, bad
> > cables, or other forms of low-level data corruption), a filesystem is
> > likely not to be able to cope with this (but see below).
> 
> > A common rebuttal here would be: "so use UFS2 without soft updates".
> > Excellent advice!  I might consider it myself!  But the problem is that
> > we cannot expect users to do that.  Why?  Because the defaults chosen
> > during sysinstall are to use SU for all filesystems except root.  If SU
> > is not reliable (or is "reliable in most cases" -- same thing if you ask
> > me), then it should not be enabled by default.  I think we (FreeBSD)
> > might have been a bit hasty in deciding to choose that as a default.
> 
> > Next: a system locking up (or a kernel panic) should result in a dirty
> > filesystem.  That filesystem should be *fully recoverable* from that
> > kind of error, with no risk of data loss (but see below).
> 
> > (There is the obvious case where a file is written to the disk, and the
> > disk has not completed writing the data from its internal cache to the
> > disk itself (re: write caching); if power is lost, the disk may not have
> > finished writing the cache to disk.  In this case, the file is going to
> > be sparse -- there is absolutely nothing that can be done about this
> > with any filesystem, including ZFS (to my knowledge).  This situation
> > is acceptable; nature of the beast.)
> 
> > The filesystem should be fully analysed and any errors repaired (either
> > with user interaction or automatically -- I'm sure it depends on the
> > kind of error) **before** the filesystem is mounted.
> 
> > This is where SU gets in the way.  The filesystem is mounted and the
> > system is brought up + online 60 seconds before the fsck starts.  The
> > assumption made is that the errors in question will be fully recoverable
> > by an automatic fsck, which as this thread proves, is not always the
> > case.
> 
> That's why I think background fsck should be disabled by default.
> Though I still don't think that soft updates hurt anything (probably
> except performance)
> 
> > ZFS is the first filesystem, to my knowledge, which provides 1) a
> > reliable filesystem, 2) detection of filesystem problems in real-time or
> > during scrubbing, 3) repair of problems in real-time (assuming raidz1 or
> > raidz2 are used), and 4) does not need fsck.  This makes ZFS powerful.
> 
> > "So use ZFS!"  A good piece of advice -- however, I've already had
> > reports from users that they will not consider ZFS for FreeBSD at this
> > time.  Why?  Because ZFS on FreeBSD can panic the system easily due to
> > kmem exhaustion.  Proper tuning can alleviate this problem, but users do
> > not want to to have to "tune" their system to get stability (and I feel
> > this is a very legitimate argument).
> 
> > Additionally, FreeBSD doesn't offer ZFS as a filesystem during
> > installation.  PC-BSD does, AFAIK.  So on FreeBSD, you have to go
> > through a bunch of rigmarole[5] to get it to work (and doing this
> > after-the-fact is a real pain in the rear -- believe me, I did it this
> > weekend.)
> 
> > So until both of these ZFS-oriented issues can be dealt with, some
> > users aren't considering it.
> 
> > This is the reality of the situation.  I don't think what users and
> > administrators want is unreasonable; they may be rough demands, but
> > that's how things are in this day and age.
> 
> > Have I provided enough evidence?  :-)
> 
> Yes, but as far as I understand it's not as bad as you think :)
> I could be wrong though.
> 
> I 100% agree on disabling background fsck, but I don't think soft
> updates are making the system any less reliable than it would be
> without it.

With regards to all you've said:

Thank you for these insights.  Everything you and Erik have said has
been quite educational, and I greatly appreciate it.  Always good to
learn from people who know more!  :-)

I believe we're in overall agreement with regards to background_fsck
(should be disabled by default).  I'd file a PR for this sort of thing,
but it almost seems like something that should go to the (private)
developers list for discussion first.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080927110329.GA50142>