From owner-freebsd-stable@FreeBSD.ORG  Thu Jul 14 23:01:04 2005
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3AEE316A41C;
	Thu, 14 Jul 2005 23:01:04 +0000 (GMT)
	(envelope-from mkb@mkbuelow.net)
Received: from luzifer.incubus.de (incubus.de [80.237.207.83])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B86DF43D46;
	Thu, 14 Jul 2005 23:01:03 +0000 (GMT)
	(envelope-from mkb@mkbuelow.net)
Received: from drjekyll.mkbuelow.net (p54AA90CE.dip0.t-ipconnect.de
	[84.170.144.206])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by luzifer.incubus.de (Postfix) with ESMTP id D55782EADA;
	Fri, 15 Jul 2005 01:03:54 +0200 (CEST)
Received: from drjekyll.mkbuelow.net (mkb@localhost.mkbuelow.net [127.0.0.1])
	by drjekyll.mkbuelow.net (8.13.3/8.13.3) with ESMTP id
	j6EN1CmC037942; Fri, 15 Jul 2005 01:01:12 +0200 (CEST)
	(envelope-from mkb@drjekyll.mkbuelow.net)
Message-Id: <200507142301.j6EN1CmC037942@drjekyll.mkbuelow.net>
From: Matthias Buelow <mkb@incubus.de>
To: Lowell Gilbert <freebsd-stable-local@be-well.ilk.org>
In-Reply-To: Message from Lowell Gilbert
	<freebsd-stable-local@be-well.ilk.org> 
	of "14 Jul 2005 18:09:07 EDT." <447jftrqf0.fsf@be-well.ilk.org> 
X-Mailer: MH-E 7.84; nmh 1.0.4; XEmacs 21.4 (patch 17)
Date: Fri, 15 Jul 2005 01:01:12 +0200
Sender: mkb@mkbuelow.net
Cc: freebsd-stable@freebsd.org, freebsd-questions@freebsd.org
Subject: Re: dangerous situation with shutdown process 
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Jul 2005 23:01:04 -0000

Lowell Gilbert <freebsd-stable-local@be-well.ilk.org> writes:

>Jon Dama <jd@ugcs.caltech.edu> writes:
>> however, journaling fairs no better, and request barriers do nothing to
>> solve the problem.
>
>I had assumed that the sequence of operations in a journal would be
>idempotent.  Is that a reasonable design criterion?  [If it is, then
>it would make up for the fact that you can't build a reliable
>transaction gate.  That is, you would just have to go back far enough
>that you *know* all of the needed journal is within the range you will
>replay.  But even then, the journal would need to be on a separate
>medium, one that doesn't have the "lying to you about transaction
>completion" problem.]

No, it needn't. It is sufficient that the journal entries for a block of
updates that are to follow are on disk before the updates are made.
That's all. This can be achieved by inserting a write barrier request in
between the journal writes and the actual data/metadata writes. The
block driver will, when it sees the barrier, a) write out all requests
in its queue that it got before the barrier, and b) flush the cache so
that they will not get intermixed by the drive with the following data
writes.

What could happen now when the power goes away at an inopportune moment?
[Note that I'm only talking about filesystem integrity, not general
data loss.]

* If power goes away before the journal is written, nothing happens.
* If the journal is partially written, and power goes away, it will
  be partially replayed at boot but the filesystem will be consistent.
* If power goes away, when the journal is fully written, but no
  metadata updates have been performed, they will be performed at
  boot and everything is as if the full request has completed before
  power went out.
* If power goes away when the journal is fully written, and parts of
  the metadata updates have been written, those updates will be performed
  twice (once more at reboot) but that won't matter since these operations
  are idempotent. The remaining metadata updates are then performed
  once, at reboot.

So where is the need for the journal to be on a seperate medium?
The only thing that matters is that no metadata updates will be written
before the journal has been written, and flushing the disk cache at a
barrier will ensure this. Note that the disk doesn't even have to flush
the cache when it receives that command, it only has to ensure that
it'll perform all requests before the flush in front of those that come
afterwards.

>I have no idea what "designed to be used with the write-back cache
>enabled" could affect the operating life of the disk.  

If you disable the write cache, you get a much higher wear&tear due
to much more seeking.
If I observe a 5x performance degradation when the cache is disabled,
for sequential writes (i.e., no cache overwriting effects), I would
think that I also have a factor >1 of increased seeking operations in
the drive, otherwise the performance degradation cannot be explained.
[Besides, the disk gets really loud when the cache is disabled.]

mkb.