Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 24 Oct 2005 11:32:48 -0700
From:      "Vinod Kashyap" <vkashyap@amcc.com>
To:        "Dan Rue" <drue@therub.org>
Cc:        freebsd-stable@FreeBSD.org
Subject:   RE: twa kernel panic under heavy IO
Message-ID:  <2B3B2AA816369A4E87D7BE63EC9D2F26D89149@SDCEXCHANGE01.ad.amcc.com>

next in thread | raw e-mail | index | archive | help
> -----Original Message-----
> From: Dan Rue [mailto:drue@therub.org]=20
> Sent: Monday, October 24, 2005 11:23 AM
> To: Vinod Kashyap
> Cc: freebsd-stable@FreeBSD.org
> Subject: Re: twa kernel panic under heavy IO
>=20
> On Mon, Oct 24, 2005 at 11:07:28AM -0700, Vinod Kashyap wrote:
> > > After going around with 3ware web support, this issue has been=20
> > > concluded, but not resolved.  I tried my 3ware 9500 on=20
> FreeBSD 5.3,=20
> > > 5.4, and 5-STABLE.  With all of these versions of OS and=20
> driver (i=20
> > > never changed the driver version manually), I received=20
> hard lock ups=20
> > > and reboots (though, interestingly, no kernel panics).
> > >=20
> > > 3ware had me check and troubleshoot a number of=20
> possibilities, until=20
> > > they finally decided it was a hardware problem and issued me a=20
> > > replacement card.  However, in the meantime, I upgraded to FreeBSD
> > > 6.0RC1 and the machine is now working flawlessly.  I returned the=20
> > > replacement card unused.
> > >=20
> > > I can only conclude that this means that there is a large
> > > (timing?) bug in the twa driver in freebsd 5.3/5.4/5-stable (as=20
> > > opposed to an isolated hardware problem with my setup).
> > >=20
> > > I have pasted the full conversation with 3ware on my website for=20
> > > those interested here:
> > > http://therub.org/9500.txt (sorry for the poor formatting)
> > >=20
> > > At one point, I received the following error message just=20
> before the=20
> > > machine locked up:
> > >=20
> > > >Oct 12 11:36:13 leopard kernel: initiate_write_filepage: already=20
> > > >started
> > >=20
> > > I grepped for that error message in the freebsd kernel=20
> source, and=20
> > > found it in sys/ufs/ffs/ffs_softdep.c on line 3580.  What=20
> makes it=20
> > > really interesting is the comment above where the error is thrown:
> > >=20
> > > if (pagedep->pd_state & IOSTARTED) {
> > >         /*
> > >          * This can only happen if there is a driver that does not
> > >          * understand chaining. Here biodone will reissue the call
> > >          * to strategy for the incomplete buffers.
> > >          */
> > >         printf("initiate_write_filepage: already started\n");
> > >         return;
> > > }
> > >=20
> > > I know this is a 3ware issue.  I am posting this=20
> resolution response=20
> > > here in hopes that it may help someone else that hits=20
> this bug - and=20
> > > with the hope that publically it will get the attention=20
> of the 3ware=20
> > > freebsd driver team/individual.
> > >=20
> >=20
> > The error messages you are seeing are consistent with bad hardware.
> > The hardware is becoming unavailable for the driver to talk to it.
> > This other message "initiate_write_filepage..." is=20
> different but did=20
> > you see the machine hang after this message got printed?  I don't=20
> > think it's related to the hang.
> >=20
>=20
> The initiate_write_filepage occured right before the hang. =20
> Here's the full log from that time:=20
>=20
> Oct  6 17:00:32 leopard kernel: twa0: ERROR: (0x16: 0x1301):=20
> Missing expected status bit(s): status reg =3D 0x15025bb0;=20
> Missing bits: [MC_RDY,] Oct  6 17:00:33 leopard last message=20
> repeated 399 times Oct  6 17:00:36 leopard kernel: ected=20
> status bit(s): status reg =3D 0x15025bb2; Missing bits:=20
> [MC_RDY,] Oct  6 17:00:36 leopard kernel: twa0: ERROR: (0x16:=20
> 0x1301): Missing expected status bit(s): status reg =3D=20
> 0x15025bb2; Missing bits: [MC_RDY,] Oct  6 17:00:36 leopard=20
> last message repeated 296 times Oct  6 17:01:37 leopard=20
> kernel: initiate_write_filepage: already started Oct  6=20
> 17:01:37 leopard last message repeated 83 times Oct  6=20
> 17:01:37 leopard kernel: twa0: ERROR: (0x05: 0x210b): Request=20
> timed out!: request =3D 0xc23fb0a0 Oct  6 17:01:37 leopard=20
> kernel: twa0: INFO: (0x16: 0x1108): Resetting controller...: =20
> Oct  6 17:01:37 leopard kernel: twa0: INFO: (0x04: 0x005e):=20
> Cache synchronized after power fail: unit=3D0 Oct  6 17:01:37=20
> leopard kernel: twa0: INFO: (0x04: 0x0001): Controller reset=20
> occurred: resets=3D1 Oct  6 17:01:37 leopard kernel: twa0:=20
> INFO: (0x16: 0x1107): Controller reset done!: =20
>=20

Ok, that message is preceded by those same messages that indicate
that the hardware became unavailable.  So, that message seems to
have been the result of the same hardware issue I mentioned.
=20
>=20
> If it's a hardware problem, why would it run fine on 6.0? =20
> The hang was very easy to trigger, and i've put the 6.0=20
> machine through the gauntlet trying to recreate the problem.
>=20
That's a valid question.  It could be only a matter of time...

> Thanks for looking into this (again) for me, Dan
>
--------------------------------------------------------

CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, =
is for the sole use of the intended recipient(s) and contains =
information that is confidential and proprietary to Applied Micro =
Circuits Corporation or its subsidiaries. It is to be used solely for =
the purpose of furthering the parties' business relationship. All =
unauthorized review, use, disclosure or distribution is prohibited. If =
you are not the intended recipient, please contact the sender by reply =
e-mail and destroy all copies of the original message.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2B3B2AA816369A4E87D7BE63EC9D2F26D89149>