From owner-freebsd-stable@FreeBSD.ORG Sat Apr 2 08:23:18 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6030D106566B for ; Sat, 2 Apr 2011 08:23:18 +0000 (UTC) (envelope-from olivier@gid0.org) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 34B838FC15 for ; Sat, 2 Apr 2011 08:23:17 +0000 (UTC) Received: by iwn33 with SMTP id 33so5217118iwn.13 for ; Sat, 02 Apr 2011 01:23:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.42.135.74 with SMTP id o10mr6818130ict.12.1301731022728; Sat, 02 Apr 2011 00:57:02 -0700 (PDT) Received: by 10.231.60.10 with HTTP; Sat, 2 Apr 2011 00:57:02 -0700 (PDT) In-Reply-To: <201104020335.p323Zp8Q018666@apollo.backplane.com> References: <87d3l6p5xv.fsf@cosmos.claresco.hr> <874o6ip0ak.fsf@cosmos.claresco.hr> <7b15d37d28f8ddac9eb81e4390231c96.HRCIM@webmail.1command.com> <14c23d4bf5b47a7790cff65e70c66151.HRCIM@webmail.1command.com> <201104020335.p323Zp8Q018666@apollo.backplane.com> Date: Sat, 2 Apr 2011 09:57:02 +0200 Message-ID: From: Olivier Smedts To: freebsd-stable@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: Constant rebooting after power loss X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Apr 2011 08:23:18 -0000 2011/4/2 Matthew Dillon : > =A0 =A0The core of the issue here comes down to two things: > > =A0 =A0First, a power loss to the drive will cause the drive's dirty writ= e cache > =A0 =A0to be lost, that data will not make it to disk. =A0Nor do you real= ly want > =A0 =A0to turn of write caching on the physical drive. =A0Well, you CAN t= urn it > =A0 =A0off, but if you do performance will become so bad that there's no = point. > =A0 =A0So turning off the write caching is really a non-starter. > > =A0 =A0The solution to this first item is for the OS/filesystem to issue = a > =A0 =A0disk flush command to the drive at appropriate times. =A0If I reca= ll the > =A0 =A0ZFS implementation in FreeBSD *DOES* do this for transaction group= s, > =A0 =A0which guarantees that a prior transaction group is fully synced be= fore > =A0 =A0a new ones starts running (HAMMER in DragonFly also does this). > =A0 =A0(Just getting an 'ack' from the write transaction over the SATA bu= s only > =A0 =A0means the data made it to the drive's cache, not that it made it t= o > =A0 =A0the platter). Amen ! > =A0 =A0I'm not sure about UFS vis-a-vie the recent UFS logging features..= . > =A0 =A0it might be an option but I don't know if it is a default. =A0Perh= aps > =A0 =A0someone can comment on that. > > =A0 =A0One last note here. =A0Many modern drives have very large ram cach= es. > =A0 =A0OCZ's SSDs have something like 256MB write caches and many modern = HDs > =A0 =A0now come with 32MB and 64MB caches. =A0Aged drives with lots of re= located > =A0 =A0sectors and bit errors can also take a very long time to perform w= rites > =A0 =A0on certain sectors. =A0So these large caches take time to drain an= d one > =A0 =A0can't really assume that an acknowledged write to disk will actual= ly > =A0 =A0make it to the disk under adverse circumstances any more. =A0All s= orts > =A0 =A0of bad things can happen. > > =A0 =A0Finally, the drives don't order their writes to the platter (you c= an > =A0 =A0set a bit to tell them to, but like many similar bits in the past = there > =A0 =A0is no real guarantee that the drives will honor it). =A0So if two > =A0 =A0transactions do not have a disk flush command inbetween them it is > =A0 =A0possible for data from the second transaction to commit to the pla= tter > =A0 =A0before all the data from the first transaction commits to the plat= ter. > =A0 =A0Or worse, for the non-transactional data to update out of order re= lative > =A0 =A0to the transactional data which was supposed to commit first. > > =A0 =A0Hence IMHO the OS/filesystem must use the disk flush command in su= ch > =A0 =A0situations for good reliability. > > =A0 =A0-- > > =A0 =A0The second problem is that a physical loss of power to the drive c= an > =A0 =A0cause the drive to physically lose one or more sectors, and can ev= en > =A0 =A0effectively destroy the drive (even with the fancy auto-park)... i= f the > =A0 =A0drive happens to be in the middle of a track write-back when power= is > =A0 =A0lost it is possible to lose far more than a single sector, includi= ng > =A0 =A0sectors unrelated to recent filesystem operations. > > =A0 =A0The only solution to #2 is to make sure your machines (or at least= the > =A0 =A0drives if they happen to be in external enclosures) are connected = to > =A0 =A0a UPS and that the machines are communicating with the UPS via > =A0 =A0something like the "apcupsd" port. =A0AND also that you test to ma= ke > =A0 =A0sure the machines properly shut themselves down when AC is lost be= fore > =A0 =A0the UPS itself runs out of battery time. =A0After all, a UPS won't= help > =A0 =A0if the machines don't at least idle their drives before power is l= ost!!! > > =A0 =A0I learned this lesson the hard way about 3 years ago. =A0I had som= ething > =A0 =A0like a dozen drives in two raid arrays doing heavy write activity = and > =A0 =A0lost physical power and several of the drives were totally destroy= ed, > =A0 =A0with thousands of sector errors. =A0Not just one or two... thousan= ds. > > =A0 =A0(It is unclear how SSDs react to physical loss of power during hea= vy > =A0 =A0writing activity. =A0Theoretically while they will certainly lose = their > =A0 =A0write cache they shouldn't wind up with any read errors). > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0-Matt > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > --=20 Olivier Smedts=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 _ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 ASCII ribbon campaign ( ) e-mail: olivier@gid0.org=A0 =A0 =A0 =A0 - against HTML email & vCards=A0 X www: http://www.gid0.org=A0 =A0 - against proprietary attachments / \ =A0 "Il y a seulement 10 sortes de gens dans le monde : =A0 ceux qui comprennent le binaire, =A0 et ceux qui ne le comprennent pas."