FreeBSD Mail Archives

Date:      Mon, 23 Jul 2007 21:42:08 -0700
From:      Jeremy Chadwick <koitsu@FreeBSD.org>
To:        Bill Swingle <unfurl@dub.net>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: problems with Hitachi 1TB SATA drives
Message-ID:  <20070724044208.GA79101@eos.sc1.parodius.com>
In-Reply-To: <46A56695.1000001@dub.net>
References:  <46A54B6F.9010100@dub.net> <200707241128.19418.doconnor@gsoft.com.au> <46A56695.1000001@dub.net>

On Mon, Jul 23, 2007 at 07:40:21PM -0700, Bill Swingle wrote:
>  Doh, I knew I forgot something in my original email.
>  Here's the full dmesg: http://dub.net/rum.dub.net.dmesg

Actually you did include this in your original Email.  I think Daniel
overlooked it.  :-)

After looking at your dmesg and your claim, I got confused because your
initial statement included the use of a 3ware card.  A verbose
description of your configuration:

* ad0: 43979MB <IBM DTLA-307045 TX6OA50C> at ata0-master UDMA100
  -- hooked to:
     atapci0: <Intel ICH5 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 31.1 on pci0
     ata0: <ATA channel 0> on atapci0
     ata1: <ATA channel 1> on atapci0

* ad4: 953869MB <Hitachi HDS721010KLA330 GKAOA70F> at ata2-master SATA150
* ad6: 953869MB <Hitachi HDS721010KLA330 GKAOA70F> at ata3-master SATA150
  -- both hooked to:
     atapci1: <Intel ICH5 SATA150 controller> port 0xec00-0xec07,0xe800-0xe803,0xe400-0xe407,0xe000-0xe003,0xdc00-0xdc0f irq 18 at device 31.2 on pci0
     ata2: <ATA channel 0> on atapci1
     ata3: <ATA channel 1> on atapci1

* twed0: <Unit 0, RAID5, Normal> on twe0
  twed0: 583440MB (1194885120 sectors)
  -- hoooked to:
     twe0: <3ware Storage Controller. Driver version 1.50.01.002> port 0xb800-0xb80f mem 0xfeaffc00-0xfeaffc0f,0xfe000000-0xfe7fffff irq 17 at device 2.0 on pci3
     twe0: [GIANT-LOCKED]
     twe0: 4 ports, Firmware FE7X 1.05.00.063, BIOS BE7X 1.08.00.048

I have to assume that atapci0 is actually using IRQ 14 even though it's
not shown (weird...).  Additionally your ICH5 SATA controller is sharing
an IRQ with a couple other devices on the PCI bus; this isn't bad, but
I'm noting it here in case this turns out to be some weird interrupt
problem:

em0: <Intel(R) PRO/1000 Network Connection Version - 6.2.9> port 0xac00-0xac1f mem 0xfd9e0000-0xfd9fffff irq 18 at device 1.0 on pci2
uhci2: <Intel 82801EB (ICH5) USB controller USB-C> port 0xd400-0xd41f irq 18 at device 29.2 on pci0

On to this:

> Jul 21 00:21:45 rum kernel: ad4: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=54194911
> Jul 21 00:22:20 rum kernel: ad4: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=107260543
> Jul 21 00:22:57 rum kernel: ad4: FAILURE - device detached
> Jul 21 00:22:57 rum kernel: subdisk4: detached
> Jul 21 00:22:57 rum kernel: ad4: detached
> Jul 21 00:24:19 rum kernel: ad6: FAILURE - device detached
> Jul 21 00:24:19 rum kernel: subdisk6: detached
> Jul 21 00:24:19 rum kernel: ad6: detached
>
> ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1456106111
> ad4: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1456106111
> ad4: FAILURE - WRITE_DMA48 timed out LBA=1456106111
> ad4: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=54194911
> ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=461407775
> ad4: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=461407775
> ad4: FAILURE - WRITE_DMA48 timed out LBA=461407775

But then:

> When trying to newfs them both eventually failed with DMA READ or
> WRITE timeouts.

Now I'm confused.  :-)  I only see evidence of a failure on ad4.  The
ad6 disk disconnecting from the bus could be caused by the controller
getting wedged while waiting for certain transactions sent to ad4
(which are failing).  I've seen this scenario happen many times.  The
panic you got is probably also induced by the same issue.

Does the WRITE_DMA/DMA48 problem happen for you when newfs'ing a
slice on ad6?

> I've read that bad SATA cables could cause this, the cables I'm using
> are brand new but are probably pretty cheap.

For testing purposes swap them out with some other cables.  It may not
be the cables at all, so keep the originals around.  Also might try
using some of that canned air to blow out any dust around the SATA
connector ends on the cables, drives, and motherboard.

Remaining questions I have:

Q: Is your ICH5 controller actually ICH5R and you've turned on some
Intel RAID option in the BIOS?  Maybe turning it on but leaving the
disks in a JBOD fashion (not defining an array)?  The reason I ask is
that you said you're going to use the Hitachi drives as "a pair of 1TB
synchronised drives", which implies RAID-1, yet I don't see use of
gmirror or ccd or anything else.  :-)

Q: What motherboard and model is this?  Looks like an Intel.

Q: If an Intel, have you gone looking at Intel's site for BIOS updates
for that board?  Intel is the one company who is thorough about
documenting BIOS changes in their Release Notes.  It would not surprise
me if this turned out to be some kind of weird BIOS bug.

Q: Some motherboards let you toggle certain "compatibility" mode stuff
for the SATA controller in the BIOS.  You might want to flip that to see
what happens (if compatibility, try the opposite.  And vice-versa of
course).

Q: Have you searched Google for issues others have reported (such as in
Linux) with the HDS721010KLA330 or similar (differently-sized) models?

-- 
| Jeremy Chadwick                                    jdc at parodius.com |
| Parodius Networking                           http://www.parodius.com/ |
| UNIX Systems Administrator                      Mountain View, CA, USA |
| Making life hard for others since 1977.                  PGP: 4BD6C0CB |

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070724044208.GA79101>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation