Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 10 Aug 2009 14:31:23 -0500
From:      "Hearn, Trevor" <trevor.hearn@Vanderbilt.Edu>
To:        John Baldwin <jhb@freebsd.org>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   RE: UFS Filesystem issues, and the loss of my hair...
Message-ID:  <8E9591D8BCB72D4C8DE0884D9A2932DC35BD34CA@ITS-HCWNEM03.ds.Vanderbilt.edu>
In-Reply-To: <200908070829.54571.jhb@freebsd.org>
References:  <8E9591D8BCB72D4C8DE0884D9A2932DC35BD34C3@ITS-HCWNEM03.ds.Vanderbilt.edu>, <200908070829.54571.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
To the FreeBSD-FS group at large...

Well, I've spent alot of time looking this one over... I setup a share on a=
 webserver to put up redacted images of the errors I am getting. They are h=
ere:

http://www.trevorhearn.com/Array/IMG_0056.jpg
http://www.trevorhearn.com/Array/IMG_0061.jpg
http://www.trevorhearn.com/Array/IMG_0063.jpg
http://www.trevorhearn.com/Array/IMG_0065.jpg
http://www.trevorhearn.com/Array/IMG_0067.jpg
http://www.trevorhearn.com/Array/IMG_0069.jpg

So, while I am in a meeting about the array, oddly, I have this come rollin=
g across the screen of the terminal session I am in...

Aug 10 10:53:43 XXXX kernel: g_vfs_done():da1p7[READ(offset=3D-641956995000=
8350720, length=3D16384)]error =3D 5
Aug 10 10:53:43 XXXX last message repeated 20 times
Aug 10 10:53:43 XXXX kernel: g_vfs_done():da1p7[READ(offset=3D-641956995000=
8350720, length=3D1638d)]error =3D 5
Aug 10 10:53:43 XXXX kernel: g_vfs_done():da1p7[READ(offset=3D-641956995000=
8350720, length=3D16384)]error =3D 5
Aug 10 10:53:43 XXXX last message repeated 18 times

When I say it was rolling across the screen, I mean it did it for about 5 m=
inutes... I was waiting for the hard-lock to happen, but the process that w=
as touching the file(s) went to 99.02%, and has stayed there the remainder =
of the day...

  PID USERNAME     THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAN=
D
 1351 xxxxxxxx        1  -8    0 10928K  4656K CPU1   0   2:10 99.02% smbd

While this happened earlier in the morning, which we were only seeing moder=
ate useage:

Aug 10 09:54:18 PRSA kernel: pid 1776 (smbd), uid 1194 inumber 107797529 on=
 /xxxxxxxxxx: bad block
Aug 10 09:54:18 PRSA kernel: bad block 165436921330628865, ino 107797529

The bad block number is WAAAY outside of what is used on the machine. So...=
.

Everything that I have found relating to these problems is everyone asking,=
 'How do I fix this', and NONE of them so far have been a fix. 'Error =3D 5=
' relates to EIO, or an error in the input/output to a device. Now, that be=
ing said, I either have a problem with the controller in my Promise Array, =
which I am learning is possible, or, I have an issue with a driver in FreeB=
SD, and just happen to have a circumstance where it will appear. There does=
 not seem to be a rhyme or reason to what is taking place. How does a set o=
f array controllers throw a bad block error? I mean, with a standard drive,=
 I can see it... but an array controller? Some other things that I have fou=
nd...

The link below tells about using 'find / -type d -exec stat {} ;'  to run t=
hru the filesystem and find the corrupted files. I did this earlier this mo=
rning, and found none. I went back thru several of the inodes that are show=
ing in the pictures above, and only found one in existence. I battened down=
 the hatches, and hit that directory. I was able to cp all of the info in t=
hat directory to another directory without a single problem. With all that =
I have been reading, this should have caused all manner of hell. I ran fsck=
 on all directories, and got the server back online... Back online? Yes. It=
 hard-locked at 3:09AM Sunday morning. Odd, since it has done that MANY tim=
es at 3:09 AM. I have Nagios watching the server, and it always seems to do=
 so at the same time. I looked at cron jobs, and found that it runs PERIODI=
C DAILY at 3:01AM. My Nagios box checks every 5 minutes, with three interva=
ls of one minute afterwards if a service is not available. SO, somewhere in=
 the list of things that the server does in the PERIODIC DAILY job, there i=
s something that makes the server fault. Tonight, I will be going thru the =
jobs, running them one by one, seeing exactly which one causes the fault. I=
 have seen others speak of it going down at 3:00AMish, so I think this migh=
t be a bit of a clue.

At this point, I am purchasing another 2 port fibre channel card, with hope=
s of installing it in a spare 1U server I have, to migrate to Ubuntu, or si=
milar. I'd like to test it out with Ubuntu, but I do not know at this point=
 if it will see the array partitions correctly, nor if it will allow me to =
access the UFS partitions that are there. Worst case, I will backup, and re=
-format the chassis themselves. I would hope that this would not be necessa=
ry, but I am almost at my wit's end.

Has ANYONE got any ideas, other than the ones presented? I'm keen to see if=
 there is a fix, because I love FreeBSD, but I can't be a evangelist for it=
 when it is giving me so much grief. Thanks for listening, I'll be here all=
 week. :)

-Trevor




________________________________________
From: John Baldwin [jhb@freebsd.org]
Sent: Friday, August 07, 2009 7:29 AM
To: freebsd-fs@freebsd.org
Cc: Hearn, Trevor
Subject: Re: UFS Filesystem issues, and the loss of my hair...

On Thursday 06 August 2009 9:51:04 am Hearn, Trevor wrote:
> First off, let me state that I love FreeBSD. I've used it for years, and
have not had any major problems with it... Until now.
>
> As you can tell, I work for a major university. I setup a large storage
array to hold data for a project they have here. No great shakes, just some
standard files and such. The fun started when I started loading users onto
the system, and they started using it... Isn't that always the case? Now, I
get ufs_dirbad errors, and the system hard locks. This isn't the worst thin=
g
that could happen, but when you're talking about file partitions the size
that I am using, the fsck takes FOREVER. Somewhere on the order of 1.5 hour=
s.
During that time, I am bringing the individual shares/partitions online, bu=
t
the users suffer. I've asked about this before, in a different forum, but g=
ot
no usable information that I could see. So, here goes...
>
> The system is as such. A dell 2950 1U server, with a Qlogic Fibre Channel
card. It is connected to two Promise Array chassis, 610 series, each with 1=
6
drives. Each chassis is running RAID 6, which gives me about 12.73tb of
storage per chassis. From there, the logical drives are sliced up into
smaller partitions. At most, I have a 3.6tb partition. The smallest is a
100gig partition.
>
> Filesystem       Size    Used   Avail Capacity  Mounted on
> /dev/mfid0s1a    197G     10G    170G     6%    /
> devfs            1.0K    1.0K      0B   100%    /dev
> /dev/da0p1       1.8T    1.5T    130G    92%    /slice1
> /dev/da0p5       2.7T    1.8T    661G    74%    /slice2
> /dev/da0p9       250G     21G    209G     9%    /slice3
> /dev/da1p3       103G     12G     83G    12%    /slice4
> /dev/da1p4       205G     54G    135G    29%    /slice5
> /dev/da1p5       103G    7.3G     87G     8%    /slice6
> /dev/da1p6       103G     22G     72G    23%    /slice7
> etc...
>
> I had to use GPT to setup the partitions, and they are using UFS2 for the
filesystem. Now... If that's not fun enough... I have TWO of these creature=
s,
which RSYNC every 4 hours. The secondary system is across campus, and sits
idle 99% of the time. Every 4 hours, in a stepped schedule, the primary arr=
ay
syncs to the secondary array. If the primary goes down, I FSCK, and any fil=
es
that are fried, I bring back across from the secondary and replace them. Th=
is
has worked OK for a while, but now I am getting Kernel Panics on a regular
basis. I've been told to migrate to a different filesystem, but my options
are ZFS and using GJOURNAL with UFS, from what I can tell. I need something
repeatable, simple, and I need something robust. I have NO idea why I keep
getting errors like this, but I imagine it's a cascading effect of other
hangs that have caused more corruption.
>
> I'd buy a fella, or gal, a cup of coffee and a pop-tart if they could hel=
p a
brother out. I have checked out this link:
>
http://phaq.phunsites.net/2007/07/01/ufs_dirbad-panic-with-mangled-entries-=
in-ufs/
> and decided that I need to give this a shot after hours, but being the ki=
nda
guy I am, I need to make sure I am covering all of my bases.

Are you seeing ufs_dirbad panics?  Specifically, can you capture the messag=
es
on the console when the machine panics?

--
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8E9591D8BCB72D4C8DE0884D9A2932DC35BD34CA>