From owner-freebsd-hackers  Thu Feb 25 11:35:47 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from arjun.niksun.com (gw.niksun.com [206.20.52.122])
	by hub.freebsd.org (Postfix) with ESMTP id 6A7DA14DA3
	for <freebsd-hackers@freebsd.org>; Thu, 25 Feb 1999 11:35:41 -0800 (PST)
	(envelope-from ath@niksun.com)
Received: from stiegl.niksun.com (stiegl.niksun.com [10.0.0.44])
	by arjun.niksun.com (8.8.8/8.8.8) with ESMTP id OAA26572
	for <freebsd-hackers@freebsd.org>; Thu, 25 Feb 1999 14:35:25 -0500 (EST)
Received: from stiegl.niksun.com (localhost.niksun.com [127.0.0.1])
	by stiegl.niksun.com (8.8.8/8.8.7) with ESMTP id OAA06361
	for <freebsd-hackers@freebsd.org>; Thu, 25 Feb 1999 14:35:23 -0500 (EST)
	(envelope-from ath@stiegl.niksun.com)
Message-Id: <199902251935.OAA06361@stiegl.niksun.com>
From: Andrew Heybey <ath@niksun.com>
To: freebsd-hackers@freebsd.org
Subject: Advice wanted on tracking down bug (or hw problem?) in 3.1R
Mime-Version: 1.0 (generated by tm-edit 7.108)
Content-Type: text/plain; charset=US-ASCII
Date: Thu, 25 Feb 1999 14:35:23 -0500
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

I have just submitted PR kern/10243, but I thought I would ask for
some advice on hackers as well.

The bug is that under certain loads, read(2) can return corrupted data
(ie data that are not in the file on disk).  The instances I have seen
are relatively small amounts (8-64 bytes) of corrupt data at the end
of a 4k page.  The corrupt data is from a file previously read or
another position in the current file.  I have also seen this problem
in 3.0-RELEASE but not in 2.2.8-RELEASE.

The load under which I see this bug is several programs reading data
from disk combined with a very high network interrupt rate (about 45k
pkts/sec on an fxp interface in promiscuous mode with a bpf listener).
The PR has a longer description of exactly what I am doing to
reproduce the bug.  I put a tar file containing a small set of
programs that I use to generate this load at
http://www.niksun.com/~ath/fbsd_bug.tgz if anyone wants to try to
reproduce this.

It looks to me like not enough splfoo() calls someplace, but I'm not
sure where to start looking.  Cam? VM? UFS? BPF (though it seems
unlikely that BPF would reach out and mess with data from another
process)?  It is extremely load sensitive so it is difficult to
reproduce the same way every time.  I can almost always make it happen 
within 5-10 minutes of testing but not in exactly the same way.q

I have reproduced the bug on two different machines, so I don't think
that the hw is defective (though the machines have substantially the
same kind of hardware so it could be a HW bug of some kind).

I would sure appreciate it if someone with a larger collection of
clues than I would take a look at this or give me some advice as to
where I should start looking.

thanks,
andrew


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message