From owner-freebsd-fs@FreeBSD.ORG Fri Aug 7 12:44:43 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6569C1065680 for ; Fri, 7 Aug 2009 12:44:43 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 1EC858FC24 for ; Fri, 7 Aug 2009 12:44:43 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 9874946B03; Fri, 7 Aug 2009 08:44:42 -0400 (EDT) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8]) by bigwig.baldwin.cx (Postfix) with ESMTPA id DA9BE8A0AB; Fri, 7 Aug 2009 08:44:41 -0400 (EDT) From: John Baldwin To: freebsd-fs@freebsd.org Date: Fri, 7 Aug 2009 08:29:54 -0400 User-Agent: KMail/1.9.7 References: <8E9591D8BCB72D4C8DE0884D9A2932DC35BD34C3@ITS-HCWNEM03.ds.Vanderbilt.edu> In-Reply-To: <8E9591D8BCB72D4C8DE0884D9A2932DC35BD34C3@ITS-HCWNEM03.ds.Vanderbilt.edu> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200908070829.54571.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Fri, 07 Aug 2009 08:44:41 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Subject: Re: UFS Filesystem issues, and the loss of my hair... X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 07 Aug 2009 12:44:44 -0000 On Thursday 06 August 2009 9:51:04 am Hearn, Trevor wrote: > First off, let me state that I love FreeBSD. I've used it for years, and have not had any major problems with it... Until now. > > As you can tell, I work for a major university. I setup a large storage array to hold data for a project they have here. No great shakes, just some standard files and such. The fun started when I started loading users onto the system, and they started using it... Isn't that always the case? Now, I get ufs_dirbad errors, and the system hard locks. This isn't the worst thing that could happen, but when you're talking about file partitions the size that I am using, the fsck takes FOREVER. Somewhere on the order of 1.5 hours. During that time, I am bringing the individual shares/partitions online, but the users suffer. I've asked about this before, in a different forum, but got no usable information that I could see. So, here goes... > > The system is as such. A dell 2950 1U server, with a Qlogic Fibre Channel card. It is connected to two Promise Array chassis, 610 series, each with 16 drives. Each chassis is running RAID 6, which gives me about 12.73tb of storage per chassis. From there, the logical drives are sliced up into smaller partitions. At most, I have a 3.6tb partition. The smallest is a 100gig partition. > > Filesystem Size Used Avail Capacity Mounted on > /dev/mfid0s1a 197G 10G 170G 6% / > devfs 1.0K 1.0K 0B 100% /dev > /dev/da0p1 1.8T 1.5T 130G 92% /slice1 > /dev/da0p5 2.7T 1.8T 661G 74% /slice2 > /dev/da0p9 250G 21G 209G 9% /slice3 > /dev/da1p3 103G 12G 83G 12% /slice4 > /dev/da1p4 205G 54G 135G 29% /slice5 > /dev/da1p5 103G 7.3G 87G 8% /slice6 > /dev/da1p6 103G 22G 72G 23% /slice7 > etc... > > I had to use GPT to setup the partitions, and they are using UFS2 for the filesystem. Now... If that's not fun enough... I have TWO of these creatures, which RSYNC every 4 hours. The secondary system is across campus, and sits idle 99% of the time. Every 4 hours, in a stepped schedule, the primary array syncs to the secondary array. If the primary goes down, I FSCK, and any files that are fried, I bring back across from the secondary and replace them. This has worked OK for a while, but now I am getting Kernel Panics on a regular basis. I've been told to migrate to a different filesystem, but my options are ZFS and using GJOURNAL with UFS, from what I can tell. I need something repeatable, simple, and I need something robust. I have NO idea why I keep getting errors like this, but I imagine it's a cascading effect of other hangs that have caused more corruption. > > I'd buy a fella, or gal, a cup of coffee and a pop-tart if they could help a brother out. I have checked out this link: > http://phaq.phunsites.net/2007/07/01/ufs_dirbad-panic-with-mangled-entries-in-ufs/ > and decided that I need to give this a shot after hours, but being the kinda guy I am, I need to make sure I am covering all of my bases. Are you seeing ufs_dirbad panics? Specifically, can you capture the messages on the console when the machine panics? -- John Baldwin