Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 6 Dec 1999 23:21:15 +0100 (MET)
From:      Gerard Roudier <groudier@club-internet.fr>
To:        Mike Smith <msmith@FreeBSD.ORG>
Cc:        Ed Hall <edhall@screech.weirdnoise.com>, freebsd-hackers@FreeBSD.ORG
Subject:   Re: PCI DMA lockups in 3.2 (3.3 maybe?) 
Message-ID:  <Pine.LNX.3.95.991206224054.405B-100000@localhost>
In-Reply-To: <199912060110.RAA09520@mass.cdrom.com>

next in thread | previous in thread | raw e-mail | index | archive | help


I have some remarks about the issue. I donnot claim it is not a software=20
problem, but ...

1) Given the actual differences between the ncr and sym drivers nowadays,=
=20
I would be surprised if the problem was due to a driver software bug.
A difference could be that recent drivers may use PCI optimized
transactions (Memory Write and Invalidate, Memory Read Multiple).

2) In order to investigate some hardware problem, we need to know about
the actual revision of PCI chips used on the system and to have access to
correspondings errata listings. I can look into the ones I have (basically
SYMBIOS chips), and into the specifications update of the 440BX that are
available from Intel site, but I donnot have anything about the network
board (neither I know of this board).

3) I donnot see the reasons that led to think the kernel stack having=20
been clobbered by some part involving the ncr/symbios chips, but may-be=20
a clear diagnosis exists.

4) Have all the pathes (PCI, memory,...) parity enabled and do
corresponding parts parity checking ?=20

5) Did you give a try using normal IO instead of MMIO for the SYMBIOS chip=
=20
and the Network chip, if code allows ?
MMIO may confuse drivers that are not aware of posted buffers. For example
a PCI device driver that writes using MMIO to some IO register to ack
something and then assumes the chip knows about is just wrong since the
transaction can be posted (a read, dummy if needed, must be performed
prior such an assumption). This also acts as barriers for drivers that are
not clean about actual instruction and memory ordering.

Just my 0.02 euros.

G=E9rard. =20

On Sun, 5 Dec 1999, Mike Smith wrote:

> > On a recent project I encountered two show-stopping bugs with 3.3-relea=
se
> > that did not exist in 2.2.8-release:
> >=20
> > 1) Random crashes in FXP interrupt or low-level IP code.  Something is
> >    clobbering the kernel stack--possibly the NCR driver, since using an
> >    Adaptec made the problem stop, as did a backport of the CAM driver
> >    Peter Wemm tried.  This was on an N440BX, which is becoming quite
> >    common in server applications.  Other installations are apparantly
> >    seeing the same problem on this hardware.
>=20
> So far the problem appears to require a combination of the 440BX chipset,=
=20
> an Intel EtherExpress and the 'fxp' driver, and an NCR/Symbios/LSI SCSI=
=20
> adapter and either the 'ncr' or 'sym' driver.  We've tried on a number of=
=20
> occasions to diagnose this problem, but there have been many issues that=
=20
> have prevented it's resolution.  These have included lack of interest on=
=20
> the driver developers' parts, lack of access to or cooperation from=20
> people complaining of the bug, and an inability to reproduce it in a=20
> useful fashion.  It's been an eye-opening exercise and we're trying to=20
> learn what we can from it, as well as actually fix it for good.
>=20
> > 2) A hard loop in the pagedaemon.  This was especially egregious, since
> >    it meant the system had to be rebooted from the console--and since
> >    the application could elicit the problem within a few minutes.
> >    Disabling the use of mmap() for file update in the application
> >    prevented the problem.  After spending a day trying to cook up a
> >    test program that elicited the same behavior that the application
> >    did, I gave up for lack of time.  But there have been other reports
> >    of late that sound like this problem, mostly in high VM/RAM situatio=
ns.
> >=20
> > That's two serious bugs that exist in 3.3-release but not in 2.2.8-rele=
ase.
> > Looking back through the archives, I can see that I'm not the only one =
who
> > has experienced them.  I came away from the experience with the feeling=
 that
> > the FreeBSD project has some serious Q/A problems... and I can assure y=
ou,
> > I'm not alone in this feeling.
>=20
> Neither are we.  But, since FreeBSD is a volunteer-developed project, and=
=20
> since you admit above that you have contributed to the lack of QA, I'm=20
> not entirely sure what your point is.  We need this feedback in a timely=
=20
> fashion in order to do something with it.  3 months after a release is=20
> not "timely" by any stretch of the imagination, and without that sort of=
=20
> assistance, I have no idea what you think we can do to improve the=20
> situation.
>=20
> Yes, we want to improve our QA.  But when customers come up months after=
=20
> the fact and complain about something that we could never possibly have=
=20
> either known or even guessed about during the development process, the=20
> best we can do is try to fix the problem then and there.  If you want to=
=20
> improve that situation, you can; in your position you have plenty of=20
> opportunities to make a major contribution to the overall quality of=20
> FreeBSD releases.  OTOH, if you choose not to do so, it's mere honesty to=
=20
> observe that you need to take a share of the blame for the current=20
> situation.
>=20
> ps: The N440BX is actually being phased out, however there are very large=
=20
>     numbers of them still in production, yes.
> --=20
> \\ Give a man a fish, and you feed him for a day. \\  Mike Smith
> \\ Tell him he should learn how to fish himself,  \\  msmith@freebsd.org
> \\ and he'll hate you for a lifetime.             \\  msmith@cdrom.com



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.LNX.3.95.991206224054.405B-100000>