From owner-freebsd-questions@freebsd.org Tue Apr 19 16:52:21 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B47B6B13965 for ; Tue, 19 Apr 2016 16:52:21 +0000 (UTC) (envelope-from galtsev@kicp.uchicago.edu) Received: from cosmo.uchicago.edu (cosmo.uchicago.edu [128.135.70.90]) by mx1.freebsd.org (Postfix) with ESMTP id 73FCB1E66 for ; Tue, 19 Apr 2016 16:52:21 +0000 (UTC) (envelope-from galtsev@kicp.uchicago.edu) Received: by cosmo.uchicago.edu (Postfix, from userid 48) id 7C6D3CB8C9E; Tue, 19 Apr 2016 11:52:20 -0500 (CDT) Received: from 128.135.52.6 (SquirrelMail authenticated user valeri) by cosmo.uchicago.edu with HTTP; Tue, 19 Apr 2016 11:52:20 -0500 (CDT) Message-ID: <13260.128.135.52.6.1461084740.squirrel@cosmo.uchicago.edu> In-Reply-To: <441t61efpv.fsf@be-well.ilk.org> References: <571533F4.8040406@bananmonarki.se> <57153E6B.6090200@gmail.com> <20160418210257.GB86917@neutralgood.org> <64031.128.135.52.6.1461017122.squirrel@cosmo.uchicago.edu> <441t61efpv.fsf@be-well.ilk.org> Date: Tue, 19 Apr 2016 11:52:20 -0500 (CDT) Subject: Re: Raid 1+0 From: "Valeri Galtsev" To: "Lowell Gilbert" Cc: "Valeri Galtsev" , "Shamim Shahriar" , "Kevin P. Neal" , freebsd-questions@freebsd.org Reply-To: galtsev@kicp.uchicago.edu User-Agent: SquirrelMail/1.4.8-5.el5.centos.7 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Apr 2016 16:52:21 -0000 On Tue, April 19, 2016 11:16 am, Lowell Gilbert wrote: > "Valeri Galtsev" writes: > >> Somebody with better knowledge of probability theory will correct me if >> I'm wrong some place. > > Well, you are assuming that the probabilities of two drives failing > are entirely independent of each other. The person to whom you are > responding asserted that this is not the case. Neither of you > presented any evidence directly to that point. Correct, we didn't hear proof of one or another. I, however, can not think of any physical mechanism that can be involved which will lead in case of failure of one drive to failure of another. That is why I assume events are (pretty much) independent. What can cause drive failure? 1. Pure mechanical reasons: head broke off, huge spot on platter surface deteriorated, new big scratch of platter was made, dirt particle left by manufacturer inside drive unstuck and started flying around,... None of these will suddenly affect drives sitting in the same enclosure 2. Electric problems: drive electronics burned out,... I safely assume that other drives, even sitting on the same power lines, are unlikely to be affected. It usually is a small brave piece of semiconductor that burs out and saves everybody else on the same power lines because short circuit behind it becomes disconnected, and burning away a piece of semiconductor doesn't require awful amount of power, and there is plenty where it comes from to feed many drives. So far I don't see any physical scenario by which failure of one drive can change probability of failure of another drive. To prove the events are not independent one needs some physical mechanism responsible for that. Without that events are independent in my opinion (exactly as I observe in my server room for over a decade and a half). But if someone observes different things (or even observed once) in one's server room, I really would like to know all the details. If they prove me wrong, I will learn something and change my hardware policies to make our equipment more reliable. I was keen getting this information whenever I was coming across these stories, but so far all "multiple failures" stories boiled down to one failure that happened long ago, and another failure that triggered merely a discovery of older failure, not the failure itself. So, "double failure" stories with all details are something that would be great to study closely. But if someone can suggest (purely theoretically) physical mechanism how one drive failure can induce (grossly increase probability of) another drive failure it will be really great to hear. Valeri ++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++