From owner-freebsd-stable@FreeBSD.ORG Fri Aug 27 19:36:06 2004 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E906216A4CE for ; Fri, 27 Aug 2004 19:36:06 +0000 (GMT) Received: from electra.cse.Buffalo.EDU (electra.cse.Buffalo.EDU [128.205.32.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 927F443D64 for ; Fri, 27 Aug 2004 19:36:06 +0000 (GMT) (envelope-from kensmith@cse.Buffalo.EDU) Received: from electra.cse.Buffalo.EDU (kensmith@localhost [127.0.0.1]) i7RJa5TH029379; Fri, 27 Aug 2004 15:36:05 -0400 (EDT) Received: (from kensmith@localhost) by electra.cse.Buffalo.EDU (8.12.10/8.12.9/Submit) id i7RJa5vd029378; Fri, 27 Aug 2004 15:36:05 -0400 (EDT) Date: Fri, 27 Aug 2004 15:36:05 -0400 From: Ken Smith To: Pavel Merdine Message-ID: <20040827193605.GC28442@electra.cse.Buffalo.EDU> References: <1076237332.20040827215245@kaluga.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1076237332.20040827215245@kaluga.ru> User-Agent: Mutt/1.4.1i cc: freebsd-stable@freebsd.org Subject: Re: ffs_alloc panic patch X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Aug 2004 19:36:07 -0000 On Fri, Aug 27, 2004 at 09:52:45PM +0400, Pavel Merdine wrote: > Panic is VERY undesirable situation. And I'm in doubt why those people > who wrote ffs like panics so devotedly: > > # grep -c "panic" ffs_alloc.c ffs_softdep.c > ffs_alloc.c:37 > ffs_softdep.c:108 > > I think such things are not acceptable in production environment. Why > those functions cannot just return a failure state and leave system > working? Actually it's checks like this and calls to panic that make the system acceptable in a production environment. A couple of examples: - Suppose the code is checking a reference counter, and that counter has become zero for a file object that the kernel believes is still in use. This should never happen, it is an indication that somewhere else there was a programming bug. Furthermore, that other piece of the system where the bug lies may have started to use pieces of that file object for *other* purposes. If you just continue on pretending nothing happened you wind up with filesystem corruption, what's on the disk is not necessarily correct. Better to have the machine crash and reboot than write bad data to the disk. - Suppose you have a disk drive that's dying and now what you write to it isn't necessarily what you read back because it's dying. Again, much better to panic the machine than to continue on pretending nothing is wrong. You would not likely have the ffs code panic the machine for data inside of files for this sort of situation but if the machine reads data in from the data structures on the disk that keep track of what files are inside of which directories, who owns those files, what the permissions are, what disk blocks the files are actually sitting on (this is generally known as "metadata") then the ffs code will typically panic the machine. - Suppose you as a sys-admin make a mistake, and somehow manage to set up two disk partitions that partially overlap (don't laugh, I've seen it happen...). Here you again wind up in a situation where the filesystem data structures on the disk can become corrupted. Typically at some point the ffs code will recognize that the metadata is incorrect and again a panic is better than trying to carry on pretending nothing is wrong. None of these things should happen. But they *can* happen and not all of them are "system bugs" - the second example is out of anyone's control and the third example is "pilot error". The consequences of not panic-ing in these situations is having corrupted data on the disks. -- Ken Smith - From there to here, from here to | kensmith@cse.buffalo.edu there, funny things are everywhere. | - Theodore Geisel |