Date: Fri, 21 May 2010 04:04:13 -0400 (EDT) From: Charles Sprickman <spork@bway.net> To: freebsd-stable@freebsd.org Subject: 7.2 filesystem corruption Message-ID: <alpine.OSX.2.00.1005210322440.12593@hotlap.local>
next in thread | raw e-mail | index | archive | help
Hello all, Not sure where to go with this post, I've tried -fs and -scsi previously in trying to track down some panics in the softdep stuff. Perhaps the more general audience here can shove me in the right direction. I have a box (Dell PE 2970) running FreeBSD 7.2/amd-64. 6 GB of ECC RAM, and a Dell-branded LSI RAID controller (mpt driver). It's a mail server with the active mail server running in a jail and a test version of same running in another jail (qmail/vpopmail/courier on old, postfix/pfadmin/dovecot on new). It passed a few weeks of heavy stress testing where I was putting much more load on it using an imap/pop/smtp test suite before going into production with only one panic (which happened during a fairly intense mstone run) - I figured I was somewhat on the bleeding edge with 7.x 64-bit at that time, so I was not overly concerned since I've run into softdep panics before. Since then however, there have been a few panics in "ufsdirhash_lookup". When this happens, the box reboots, does a background fsck and does not complain about anything. I decided background fsck was probably not a good idea, so I disabled it and manually fsck'd on all subsequent panics. The pattern is similar to this example: ** /dev/mfid0s1g ** Last Mounted on /spool ** Phase 1 - Check Blocks and Sizes UNKNOWN FILE TYPE I=147718184 UNEXPECTED SOFT UPDATE INCONSISTENCY CLEAR? yes PARTIALLY ALLOCATED INODE I=147718185 UNEXPECTED SOFT UPDATE INCONSISTENCY And in phase 2, lots of this: UNALLOCATED I=152688468 OWNER=root MODE=0 SIZE=0 MTIME=Dec 31 19:00 1969 NAME=/jails/mailbak.blah.net/home/vpopmail/domains/blah.net/A/spec/Maildir/new/1233549930.73014.blah.bway.net UNEXPECTED SOFT UPDATE INCONSISTENCY REMOVE? yes And in Phase 4, lots of this: ** Phase 4 - Check Reference Counts UNREF FILE I=147623979 OWNER=88 MODE=100600 SIZE=0 MTIME=Feb 7 00:19 2010 CLEAR? yes In the manual runs, I tend to run through about 3 or 4 times, since even though the filesystem gets marked "clean", another run finds more errors. Once I get two clean runs in a row, I let the box boot. Regardless of how "clean" the fs is, I have consistently seen messages like this in my serial console log: g_vfs_done():mfid0s1g[READ(offset=2456998070156636160, length=16384)]error = 5 g_vfs_done():mfid0s1g[READ(offset=2456998070156636160, length=16384)]error = 5 On the last run, I also turned off soft updates for good measure. Now I occasinally get these errors: g_vfs_done():mfid0s1g[READ(offset=5335388948596480000, length=16384)]error = 5 bad block 838May 18 00:29:14 8bigmail kernel: 3pid 24481 (rm), 0uid 0 inumber 1571657736 on /spoo6l: bad block 76548920427, ino 151657736 In addition, there are some files that now have bizarre flags set, such as "schg", "sappnd", "opaque", etc. Some can be changed, others give a "bad file descriptor" error. I fear the fs is getting more scrambled. I started to think that I'm probably dealing with two things - some bug in 64-bit UFS2, plus a perpetually dirty filesystem that causes the box to panic, which causes more corruption, and so on. I do have the option of trying to schedule a huge maintenance window and dumping the fs, newfs'ing it, and then restoring it, but it's a tough sell and for various reasons I can't put a ton of time into this (anyone that knows me, hit me up offlist for a fun story). I'm also quite concerned that fsck is finding and fixing things, but the fs is still obviously not quite "right". In short, how can I ensure this won't happen a week after a dump/restore? So that's the story, here's my questions: -Is there any interest in tracking down what the nature of the initial panic/corruption is? I know I'm a release behind, but digging through the PR database, nothing stuck out as far as softdep, mpt, or dirhash bugs that looked similar to what I'm seeing that got fixed in 7.3. -Where is the most likely place to look for a problem here? The mpt driver? The megacli utility and the bios utility both claim the array is in great shape. The only fs that ever shows the errors with "g_vfs_done" and the nonsensical offsets is the partition where the jails reside. Or is it ufsdirhash thing? I saw some interesting bug reports, but nothing that quite matched. UFS2/SU itself? -If I do dump/restore (or pull from backups), should I stick to 7.2 or go to 7.3 while I'm working on the box? Or gamble on 8.0 (where I've oddly enough seen much fewer odd thigns of late)? For reference, here's a few other queries regarding this issue: http://marc.info/?l=freebsd-stable&m=125901173424554&w=2 http://old.nabble.com/7.2-p4:-panic:-ufsdirhash_lookup:-bad-offset-in-hash-array-td27715632.html I still have some core dumps sitting here as well. Any input would be appreciated - I do have more info available, but this message is already about twice as long as I'd like it to be. Hit me up with any questions. Thanks, Charles
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.OSX.2.00.1005210322440.12593>