From owner-freebsd-stable@FreeBSD.ORG  Sat Apr  2 08:23:18 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6030D106566B
	for <freebsd-stable@freebsd.org>; Sat,  2 Apr 2011 08:23:18 +0000 (UTC)
	(envelope-from olivier@gid0.org)
Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com
	[209.85.214.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 34B838FC15
	for <freebsd-stable@freebsd.org>; Sat,  2 Apr 2011 08:23:17 +0000 (UTC)
Received: by iwn33 with SMTP id 33so5217118iwn.13
	for <freebsd-stable@freebsd.org>; Sat, 02 Apr 2011 01:23:17 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.42.135.74 with SMTP id o10mr6818130ict.12.1301731022728; Sat,
	02 Apr 2011 00:57:02 -0700 (PDT)
Received: by 10.231.60.10 with HTTP; Sat, 2 Apr 2011 00:57:02 -0700 (PDT)
In-Reply-To: <201104020335.p323Zp8Q018666@apollo.backplane.com>
References: <87d3l6p5xv.fsf@cosmos.claresco.hr>
	<AANLkTi=kEyz-mKLzdV8LAf91ZhMTP8gLKs=3Eu5WD8mh@mail.gmail.com>
	<874o6ip0ak.fsf@cosmos.claresco.hr>
	<7b15d37d28f8ddac9eb81e4390231c96.HRCIM@webmail.1command.com>
	<AANLkTi=KEwmm1hM6Z=r_SWUAn9KhUrkTVzfF6VmqQauW@mail.gmail.com>
	<14c23d4bf5b47a7790cff65e70c66151.HRCIM@webmail.1command.com>
	<AANLkTi=6pqRwJ96Lg=603cYg_f8QUXkg8aXtbjbYpFrV@mail.gmail.com>
	<201104020335.p323Zp8Q018666@apollo.backplane.com>
Date: Sat, 2 Apr 2011 09:57:02 +0200
Message-ID: <BANLkTik9aN7TZ_pSZ1b=nMeXO-mW-fYuUA@mail.gmail.com>
From: Olivier Smedts <olivier@gid0.org>
To: freebsd-stable@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Subject: Re: Constant rebooting after power loss
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Apr 2011 08:23:18 -0000

2011/4/2 Matthew Dillon <dillon@apollo.backplane.com>:
> =A0 =A0The core of the issue here comes down to two things:
>
> =A0 =A0First, a power loss to the drive will cause the drive's dirty writ=
e cache
> =A0 =A0to be lost, that data will not make it to disk. =A0Nor do you real=
ly want
> =A0 =A0to turn of write caching on the physical drive. =A0Well, you CAN t=
urn it
> =A0 =A0off, but if you do performance will become so bad that there's no =
point.
> =A0 =A0So turning off the write caching is really a non-starter.
>
> =A0 =A0The solution to this first item is for the OS/filesystem to issue =
a
> =A0 =A0disk flush command to the drive at appropriate times. =A0If I reca=
ll the
> =A0 =A0ZFS implementation in FreeBSD *DOES* do this for transaction group=
s,
> =A0 =A0which guarantees that a prior transaction group is fully synced be=
fore
> =A0 =A0a new ones starts running (HAMMER in DragonFly also does this).
> =A0 =A0(Just getting an 'ack' from the write transaction over the SATA bu=
s only
> =A0 =A0means the data made it to the drive's cache, not that it made it t=
o
> =A0 =A0the platter).

Amen !

> =A0 =A0I'm not sure about UFS vis-a-vie the recent UFS logging features..=
.
> =A0 =A0it might be an option but I don't know if it is a default. =A0Perh=
aps
> =A0 =A0someone can comment on that.
>
> =A0 =A0One last note here. =A0Many modern drives have very large ram cach=
es.
> =A0 =A0OCZ's SSDs have something like 256MB write caches and many modern =
HDs
> =A0 =A0now come with 32MB and 64MB caches. =A0Aged drives with lots of re=
located
> =A0 =A0sectors and bit errors can also take a very long time to perform w=
rites
> =A0 =A0on certain sectors. =A0So these large caches take time to drain an=
d one
> =A0 =A0can't really assume that an acknowledged write to disk will actual=
ly
> =A0 =A0make it to the disk under adverse circumstances any more. =A0All s=
orts
> =A0 =A0of bad things can happen.
>
> =A0 =A0Finally, the drives don't order their writes to the platter (you c=
an
> =A0 =A0set a bit to tell them to, but like many similar bits in the past =
there
> =A0 =A0is no real guarantee that the drives will honor it). =A0So if two
> =A0 =A0transactions do not have a disk flush command inbetween them it is
> =A0 =A0possible for data from the second transaction to commit to the pla=
tter
> =A0 =A0before all the data from the first transaction commits to the plat=
ter.
> =A0 =A0Or worse, for the non-transactional data to update out of order re=
lative
> =A0 =A0to the transactional data which was supposed to commit first.
>
> =A0 =A0Hence IMHO the OS/filesystem must use the disk flush command in su=
ch
> =A0 =A0situations for good reliability.
>
> =A0 =A0--
>
> =A0 =A0The second problem is that a physical loss of power to the drive c=
an
> =A0 =A0cause the drive to physically lose one or more sectors, and can ev=
en
> =A0 =A0effectively destroy the drive (even with the fancy auto-park)... i=
f the
> =A0 =A0drive happens to be in the middle of a track write-back when power=
 is
> =A0 =A0lost it is possible to lose far more than a single sector, includi=
ng
> =A0 =A0sectors unrelated to recent filesystem operations.
>
> =A0 =A0The only solution to #2 is to make sure your machines (or at least=
 the
> =A0 =A0drives if they happen to be in external enclosures) are connected =
to
> =A0 =A0a UPS and that the machines are communicating with the UPS via
> =A0 =A0something like the "apcupsd" port. =A0AND also that you test to ma=
ke
> =A0 =A0sure the machines properly shut themselves down when AC is lost be=
fore
> =A0 =A0the UPS itself runs out of battery time. =A0After all, a UPS won't=
 help
> =A0 =A0if the machines don't at least idle their drives before power is l=
ost!!!
>
> =A0 =A0I learned this lesson the hard way about 3 years ago. =A0I had som=
ething
> =A0 =A0like a dozen drives in two raid arrays doing heavy write activity =
and
> =A0 =A0lost physical power and several of the drives were totally destroy=
ed,
> =A0 =A0with thousands of sector errors. =A0Not just one or two... thousan=
ds.
>
> =A0 =A0(It is unclear how SSDs react to physical loss of power during hea=
vy
> =A0 =A0writing activity. =A0Theoretically while they will certainly lose =
their
> =A0 =A0write cache they shouldn't wind up with any read errors).
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0-Matt
>
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
>


--=20
Olivier Smedts=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 _
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 ASCII ribbon campaign ( )
e-mail: olivier@gid0.org=A0 =A0 =A0 =A0 - against HTML email & vCards=A0 X
www: http://www.gid0.org=A0 =A0 - against proprietary attachments / \

=A0 "Il y a seulement 10 sortes de gens dans le monde :
=A0 ceux qui comprennent le binaire,
=A0 et ceux qui ne le comprennent pas."