Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 9 Dec 2002 00:31:38 +0100
From:      Francesco Casadei <fcasadei@inwind.it>
To:        freebsd-questions@freebsd.org
Subject:   Re: ATA errors
Message-ID:  <20021208233138.GA2252@goku.kasby>
In-Reply-To: <20021207073513.GB34099@dru.dn.ua>
References:  <005401c29d44$b0e24130$c00c460a@pro.tl.thomcorp.net> <3DF0DF91.1050002@pantherdragon.org> <20021207073513.GB34099@dru.dn.ua>

next in thread | previous in thread | raw e-mail | index | archive | help

--ZGiS0Q5IWpPtfppv
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Dec 07, 2002 at 09:35:13AM +0200, Vladislav V. Zhuk wrote:
[snip]
>=20
> I don't think like you.
> I check my hardware and I consider that problem in new ATA driver.
> Under FreeBSD 4.1.1 my hardware work excellent.
> After 4.5 release I get more troubles with IDE devices.
> Some bugs was fixed and now (under 4.7s) I have no problem
> with IDE HDD (even softupdates work).
>=20
> After reboot my system work excellent 2-5 days, than I get
> "read timeout" problem with my CDROM and all system hang.
>=20
> I wrote about that troubles with ATA, but not get answer...
>=20
> Who have problem with ATA driver - write here about this
> and show /var/run/dmesg. Maybe we discover some dependences
> where trouble appeared....
>=20
> --
> Vladislav V. Zhuk (06267)3-60-03  admin@dru.dn.ua  2:465/197@FidoNet.org
>=20
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-questions" in the body of the message
>=20
> end of the original message

I had a lot of problems with tagged queuing enabled on IBM drives.

I have a server with a Promise FastTrak TX2 ATA RAID controller and 2
IBM 40G drives attached to it. I have another IBM drive (identical to the o=
ther
two) attached to the mainboard's ATA controller.

# atacontrol list
ATA channel 0:
    Master:  ad0 <IC35L040AVER07-0/ER4OA44A> ATA/ATAPI rev 5
    Slave:       no device present
ATA channel 1:
    Master: acd0 <LG CD-ROM CRD-8521B/1.03> ATA/ATAPI rev 0
    Slave:       no device present
ATA channel 2:
    Master:  ad4 <IC35L040AVER07-0/ER4OA44A> ATA/ATAPI rev 5
    Slave:       no device present
ATA channel 3:
    Master:  ad6 <IC35L040AVER07-0/ER4OA44A> ATA/ATAPI rev 5
    Slave:       no device present


The filesystems layout is:

# mount
/dev/ar0s1a on / (ufs, local, soft-updates)
/dev/ar0s1f on /usr (ufs, local, soft-updates)
/dev/ar0s1d on /var (ufs, local, noatime, soft-updates)
/dev/ar0s1e on /var/tmp (ufs, local, soft-updates)
/dev/ar0s1g on /db (ufs, local, soft-updates)
/dev/ar0s1h on /home (ufs, local, noatime, soft-updates)
/dev/ad0s1a on /backup (ufs, local, soft-updates)
procfs on /proc (procfs, local)

The sysctl's hw.ata tunables are set as follows:

# sysctl -a | grep 'hw\.ata'
hw.ata.ata_dma: 1
hw.ata.wc: 1
hw.ata.tags: 1
hw.ata.atapi_dma: 0

The server ran without problems since october 2001 till the summer of 2002,
when an MFC broke the tagged queing support. I had to set hw.ata.tags to 0 =
to
avoid kernel panics and have the system up and running. Finally, the TQ sup=
port
was (apparently) fixed and I re-enabled it. The system ran fine for a short
time though, because the drive on the second channel of the Promise control=
ler
began to fallback to PIO mode.
I don't think it's a hardware problem, because I rebooted the system from t=
he
live-system CD of the FreeBSD distribution set and ran dd on the faulty dri=
ve:
no error was reported.
I have rebuilt the array using the Promise utilty and rebooted the system w=
hich
ran in UDMA100 mode for a couple of weeks. Then the problem appeared again:

Dec  4 05:40:02 zeus /kernel: ad6: SERVICE timeout tag=3D0 s=3D51 e=3D04
Dec  4 05:40:02 zeus /kernel: ad6: invalidating queued requests
Dec  4 05:40:02 zeus /kernel: ad6: no request for tag=3D0
Dec  4 05:40:02 zeus /kernel: ad6: invalidating queued requests
Dec  4 05:40:12 zeus /kernel: ad6: READ command timeout tag=3D0 serv=3D1 -
resetting
Dec  4 05:40:22 zeus /kernel: ad6: invalidating queued requests
Dec  4 05:40:22 zeus /kernel: ata3: resetting devices .. ad6: invalidating
queued requests
Dec  4 05:40:22 zeus /kernel: done
Dec  4 05:40:22 zeus /kernel: ad6: READ command timeout tag=3D0 serv=3D1 -
resetting
Dec  4 05:40:22 zeus /kernel: ad6: invalidating queued requests
Dec  4 05:40:22 zeus /kernel: ata3: resetting devices .. ad6: invalidating
queued requests
Dec  4 05:40:22 zeus /kernel: done
Dec  4 05:40:22 zeus /kernel: ad6: no request for tag=3D0
Dec  4 05:40:22 zeus /kernel: ad6: invalidating queued requests
Dec  4 05:40:32 zeus /kernel: ad6: READ command timeout tag=3D0 serv=3D1 -
resetting
Dec  4 05:40:52 zeus /kernel: ad6: invalidating queued requests
Dec  4 05:40:52 zeus /kernel: ata3: resetting devices .. ad6: invalidating
queued requests
Dec  4 05:40:52 zeus /kernel: done
Dec  4 05:40:52 zeus /kernel: ad6: timeout waiting for READY
Dec  4 05:40:52 zeus /kernel: ad6: invalidating queued requests
Dec  4 05:40:52 zeus /kernel: ad6: timeout sending command=3D00 s=3Dd0 e=3D=
04
Dec  4 05:40:52 zeus /kernel: ad6: flush queue failed
Dec  4 05:40:52 zeus /kernel: - resetting
Dec  4 05:40:52 zeus /kernel: ata3: resetting devices .. ad6: invalidating
queued requests
Dec  4 05:40:52 zeus /kernel: done
Dec  4 05:40:52 zeus /kernel: ad6: READ command timeout tag=3D0 serv=3D1 -
resetting
Dec  4 05:40:52 zeus /kernel: ad6: invalidating queued requests
Dec  4 05:40:52 zeus /kernel: ad6: trying fallback to PIO mode
Dec  4 05:40:52 zeus /kernel: ata3: resetting devices .. ad6: invalidating
queued requests
Dec  4 05:40:52 zeus /kernel: done
Dec  4 05:40:52 zeus /kernel: ad6: WRITE command timeout tag=3D0 serv=3D0 -
resetting
Dec  4 05:40:52 zeus /kernel: ad6: invalidating queued requests
Dec  4 05:40:52 zeus /kernel: ata3: resetting devices .. ad6: invalidating
queued requests
Dec  4 05:40:52 zeus /kernel: done

(The most recent error report is shown)

# atacontrol mode 3
Master =3D PIO4=20
Slave  =3D ???

# atacontrol mode 3 udma100 xxx
Master =3D UDMA100=20
Slave  =3D ???

# atacontrol mode 3
Master =3D UDMA100=20
Slave  =3D ???

If I execute an IO-intensive program then the system falls back to PIO mode=
 4:

# find /usr/ports/ -name nonexistent

# atacontrol mode 3
Master =3D PIO4=20
Slave  =3D ???


If I reboot the system the Promise utilty tells me that the array has a
critical status. If I rebuild the array and reboot the system, then everyth=
ing
is fine for other 1-4 weeks before the problem appears again!

Note that the problem appears always before the completion of backup activi=
ty.
=46rom the daily run output before the drive failure:

Last dump(s) done (Dump '>' file systems):
> /dev/ar0s1a   (     /) Last dump: Level 0, Date Tue Dec  3 05:30
  /dev/ar0s1d   (  /var) Last dump: Level 0, Date Tue Dec  3 05:30
  /dev/ar0s1e   (/var/tmp) Last dump: Level 0, Date Tue Dec  3 05:30
  /dev/ar0s1f   (  /usr) Last dump: Level 0, Date Tue Dec  3 05:30
  /dev/ar0s1g   (   /db) Last dump: Level 0, Date Tue Dec  3 05:40
  /dev/ar0s1h   ( /home) Last dump: Level 0, Date Tue Dec  3 05:39

On dec, 4th at 05:40:02 the timeout problem appears:
Last dump(s) done (Dump '>' file systems):
> /dev/ar0s1a   (     /) Last dump: Level 0, Date Wed Dec  4 05:30
  /dev/ar0s1d   (  /var) Last dump: Level 0, Date Wed Dec  4 05:30
  /dev/ar0s1e   (/var/tmp) Last dump: Level 0, Date Wed Dec  4 05:30
  /dev/ar0s1f   (  /usr) Last dump: Level 0, Date Wed Dec  4 05:30
  /dev/ar0s1g   (   /db) Last dump: Level 0, Date Wed Dec  4 05:41
  /dev/ar0s1h   ( /home) Last dump: Level 0, Date Wed Dec  4 05:39

note that the duration of the backup of /dev/ar0s1g is 1 minute longer than
usual (with exactly the same load, not showed).

After the problem appeared:

Last dump(s) done (Dump '>' file systems):
> /dev/ar0s1a   (     /) Last dump: Level 0, Date Thu Dec  5 05:30
  /dev/ar0s1d   (  /var) Last dump: Level 0, Date Thu Dec  5 05:30
  /dev/ar0s1e   (/var/tmp) Last dump: Level 0, Date Thu Dec  5 05:30
  /dev/ar0s1f   (  /usr) Last dump: Level 0, Date Thu Dec  5 05:30
  /dev/ar0s1g   (   /db) Last dump: Level 0, Date Thu Dec  5 05:47
  /dev/ar0s1h   ( /home) Last dump: Level 0, Date Thu Dec  5 05:44

obviously the system is slower, but it works.

I'm tired to reboot and rebuild the array each time, can anybody help me to
solve this problem?

	Francesco Casadei

P.S. sorry for the long post, but I'm sure the information I gave you will =
help
you to diagnose the problem.

--=20
You can download my public key from http://digilander.libero.it/fcasadei/
or retrieve it from a keyserver (pgpkeys.mit.edu, wwwkeys.pgp.net, ...)

Key fingerprint is: 1671 9A23 ACB4 520A E7EE  00B0 7EC3 375F 164E B17B


--ZGiS0Q5IWpPtfppv
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (FreeBSD)
Comment: For info see http://www.gnupg.org

iD8DBQE989ZafsM3XxZOsXsRAjQGAJ9GV4NtdiImD17ytrhqu9cVeqDetQCg1O8S
iPAFYXFDnWKtfrhb0icH5ys=
=8JqG
-----END PGP SIGNATURE-----

--ZGiS0Q5IWpPtfppv--

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20021208233138.GA2252>