Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 10 Jun 2010 11:17:03 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Robin Sommer <robin@icir.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: File system trouble with ICH9 controller
Message-ID:  <20100610181703.GA80162@icarus.home.lan>
In-Reply-To: <20100610162918.GA23022@icir.org>
References:  <20100610162918.GA23022@icir.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 10, 2010 at 09:29:19AM -0700, Robin Sommer wrote:
> I'm running 8.0-RELEASE-p2 (amd64) on a larger number of Supermicro
> SBI-7425C-T3 blades. Each of the blades has 2 x 500GB disks striped
> into a single volume via the on-board ICH9 RAID controller. 
>
> However, after running fine for a while (days), the blades crash
> eventually with file system problems such as the one below.
> Initially I thought that must be a bad disk, but by now 5 different
> blades have shown similar problems so I'm suspecting some OS issue. 
> 
> Has anybody seen something similar before? Could this be an
> incompatibility with the RAID controller (I haven't found much
> recent on Google but there are a number of older threads indicating
> that it might not be well supported. Not sure though whether those
> still apply).  
>
> Jun  9 10:00:02 <user.crit> blade19 kernel: ar0s1a[WRITE(offset=704187858944, length=114688)]error = 5
> Jun  9 10:00:02 <user.crit> blade19 kernel: g_vfs_done():ar0s1a[WRITE(offset=704188219392, length=131072)]error = 5
> Jun  9 10:00:02 <user.crit> blade19 kernel: g_vfs_done():ar0s1a[WRITE(offset=704188891136, length=114688)]error = 5
> Jun  9 10:00:02 <user.crit> blade19 kernel: g_vfs_done():ar0s1a[WRITE(offset=704189382656, length=114688)]error = 5
> Jun  9 10:00:02 <user.crit> blade19 kernel: g_vfs_done():ar0s1a[WRITE(offset=704189743104, length=131072)]
> Jun  9 10:00:02 <user.crit> blade19 kernel: error = 5

You're using Intel MatrixRAID.  Please stop[1]; you're living
dangerously.

The messages your kernel is spitting out could indicate a lot of
different things.  Tracking it down will take time.  So let's start wit
this:

1) Provide output from "gpart show ar0s1".  I'm curious about something
(likely a red herring, but I want to see).

2) Install sysutils/smartmontools and run "smartctl -a /dev/adXX" on
each of the disks which make up the RAID array.  I believe FreeBSD can
see the disks associated with the array (meaning you should have a few
adXX disks, in addition to an ar0 entry).  I can help you decode the
output, to see if any of the disks have actual problems that indicate
they could be going bad.

3) Remove use of MatrixRAID.  Alternatives include ccd, gstripe, gvinum,
or ZFS.  I would recommend ZFS if you ran RELENG_8 instead of -RELEASE,
system was amd64, and has at least 4GB RAM.  Remove use of MatrixRAID
first, then see if the problem goes away.

4) If the problem still happens after this, there should be developers
who can help diagnose the problem.  Keeping MatrixRAID out of the
picture helps greatly.

More details: you might consider these opinions, but they're based on
personal experience (I've dealt many a time with MatrixRAID).  The
problem is not with the ICH9, given that most of our systems are
Supermicro (not blades but that doesn't matter) and use ICH9 with AHCI
(both with and without ahci.ko).  Intel ICHxx and ESBx controllers are
heavily tested on FreeBSD, both by users and developers.


[1]: http://en.wikipedia.org/wiki/Intel_Matrix_RAID

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100610181703.GA80162>