From owner-freebsd-stable@FreeBSD.ORG Thu Jul 14 23:01:04 2005 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3AEE316A41C; Thu, 14 Jul 2005 23:01:04 +0000 (GMT) (envelope-from mkb@mkbuelow.net) Received: from luzifer.incubus.de (incubus.de [80.237.207.83]) by mx1.FreeBSD.org (Postfix) with ESMTP id B86DF43D46; Thu, 14 Jul 2005 23:01:03 +0000 (GMT) (envelope-from mkb@mkbuelow.net) Received: from drjekyll.mkbuelow.net (p54AA90CE.dip0.t-ipconnect.de [84.170.144.206]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by luzifer.incubus.de (Postfix) with ESMTP id D55782EADA; Fri, 15 Jul 2005 01:03:54 +0200 (CEST) Received: from drjekyll.mkbuelow.net (mkb@localhost.mkbuelow.net [127.0.0.1]) by drjekyll.mkbuelow.net (8.13.3/8.13.3) with ESMTP id j6EN1CmC037942; Fri, 15 Jul 2005 01:01:12 +0200 (CEST) (envelope-from mkb@drjekyll.mkbuelow.net) Message-Id: <200507142301.j6EN1CmC037942@drjekyll.mkbuelow.net> From: Matthias Buelow To: Lowell Gilbert In-Reply-To: Message from Lowell Gilbert of "14 Jul 2005 18:09:07 EDT." <447jftrqf0.fsf@be-well.ilk.org> X-Mailer: MH-E 7.84; nmh 1.0.4; XEmacs 21.4 (patch 17) Date: Fri, 15 Jul 2005 01:01:12 +0200 Sender: mkb@mkbuelow.net Cc: freebsd-stable@freebsd.org, freebsd-questions@freebsd.org Subject: Re: dangerous situation with shutdown process X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Jul 2005 23:01:04 -0000 Lowell Gilbert writes: >Jon Dama writes: >> however, journaling fairs no better, and request barriers do nothing to >> solve the problem. > >I had assumed that the sequence of operations in a journal would be >idempotent. Is that a reasonable design criterion? [If it is, then >it would make up for the fact that you can't build a reliable >transaction gate. That is, you would just have to go back far enough >that you *know* all of the needed journal is within the range you will >replay. But even then, the journal would need to be on a separate >medium, one that doesn't have the "lying to you about transaction >completion" problem.] No, it needn't. It is sufficient that the journal entries for a block of updates that are to follow are on disk before the updates are made. That's all. This can be achieved by inserting a write barrier request in between the journal writes and the actual data/metadata writes. The block driver will, when it sees the barrier, a) write out all requests in its queue that it got before the barrier, and b) flush the cache so that they will not get intermixed by the drive with the following data writes. What could happen now when the power goes away at an inopportune moment? [Note that I'm only talking about filesystem integrity, not general data loss.] * If power goes away before the journal is written, nothing happens. * If the journal is partially written, and power goes away, it will be partially replayed at boot but the filesystem will be consistent. * If power goes away, when the journal is fully written, but no metadata updates have been performed, they will be performed at boot and everything is as if the full request has completed before power went out. * If power goes away when the journal is fully written, and parts of the metadata updates have been written, those updates will be performed twice (once more at reboot) but that won't matter since these operations are idempotent. The remaining metadata updates are then performed once, at reboot. So where is the need for the journal to be on a seperate medium? The only thing that matters is that no metadata updates will be written before the journal has been written, and flushing the disk cache at a barrier will ensure this. Note that the disk doesn't even have to flush the cache when it receives that command, it only has to ensure that it'll perform all requests before the flush in front of those that come afterwards. >I have no idea what "designed to be used with the write-back cache >enabled" could affect the operating life of the disk. If you disable the write cache, you get a much higher wear&tear due to much more seeking. If I observe a 5x performance degradation when the cache is disabled, for sequential writes (i.e., no cache overwriting effects), I would think that I also have a factor >1 of increased seeking operations in the drive, otherwise the performance degradation cannot be explained. [Besides, the disk gets really loud when the cache is disabled.] mkb.