From owner-freebsd-stable  Wed Jan  3 16:24:37 2001
From owner-freebsd-stable@FreeBSD.ORG  Wed Jan  3 16:24:33 2001
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from wantadilla.lemis.com (wantadilla.lemis.com [192.109.197.80])
	by hub.freebsd.org (Postfix) with ESMTP
	id 6B64C37B400; Wed,  3 Jan 2001 16:24:31 -0800 (PST)
Received: by wantadilla.lemis.com (Postfix, from userid 1004)
	id 086CE6A90D; Thu,  4 Jan 2001 10:54:29 +1030 (CST)
Date: Thu, 4 Jan 2001 10:54:28 +1030
From: Greg Lehey <grog@lemis.com>
To: Daniel Lang <dl@leo.org>
Cc: Andy Newman <andy@silverbrook.com.au>,
	Roman Shterenzon <roman@jamus.xpert.com>,
	freebsd-gnats-submit@freebsd.org, freebsd-stable@freebsd.org
Subject: Re: kern/21148: multiple crashes while using vinum
Message-ID: <20010104105428.D4336@wantadilla.lemis.com>
References: <200101012239.f01MdiH40906@freefall.freebsd.org> <20010103145232.B10169@atrbg11.informatik.tu-muenchen.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <20010103145232.B10169@atrbg11.informatik.tu-muenchen.de>; from dl@leo.org on Wed, Jan 03, 2001 at 02:52:35PM +0000
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
Phone: +61-8-8388-8286
Fax: +61-8-8388-8725
Mobile: +61-418-838-708
WWW-Home-Page: http://www.lemis.com/~grog
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Wednesday,  3 January 2001 at 14:52:35 +0000, Daniel Lang wrote:
> Dear Greg, Andy, Roman,
>
> grog@FreeBSD.org wrote on Mon, Jan 01, 2001 at 11:41:19PM +0000:
>> Synopsis: multiple crashes while using vinum
> [..]
>> State-Changed-Why:
>> No feedback from submitter.
>>
>> http://www.freebsd.org/cgi/query-pr.cgi?pr=21148
>
> Well, I've sent you stack-traces, with (and alas as well without)
> debugging symbols, I am perfectly aware of your instruction page
> about debugging vinum, and not an ignorant moron, who complains
> without reading. Unfortunately you don't seem to trust me
> or other people in this matter.

As my closing message says, the reason I closed the PR was:

>> No feedback from submitter.

I sent you a message on 10 September 2000 asking for additional
information.  I received none.  There's no reason to get all upset
now, or make claims about my intentions.  This was just a dead PR, and
you've made it clear, both before and now, that you have no intention
of following up on it.  This is not a question of "ignorant morons" or
"trust".

> The reason is, that _some code_ writes into unallocated memory, in
> my case overwriting a data-structure of an ata-request with a few
> zero bytes, causing the panic. The stack trace allows me to trace
> the problem back to this point, but not further. I later experienced
> a similar problem on a scsi-only system.

Yes, this looks very much like the other issues.  But you must
understand that there's nothing I can do without further information.

> The reason, why I filed this pr unter 'vinum' is, that it only
> occured on boxes using vinum, and perfectly reproducable via simple
> operations like a 'find /vinum/file/system -print' on a larger and
> moderately filled vinum-filesystem.  Perfectly reproducable means:
> each night, periodic daily caused the panic (traceable to the find
> call in /etc/security, finding files with setuid bits).
>
> As far as I know, the only way to trace this writing into
> unallocated/otherallocated memory resp. buffer overrun
> would be to set a watchpoint to the overwritten data-structure
> within the kernel-debugger.

The trouble with that is that this only happens when the system is
very active, and there are thousands of potential buffer headers which
could be trashed.  I do have a trace facility within Vinum, but even
with that it's difficult to figure out what's going on.

> My stack-traces showed that this memory region stays the same on the
> same machine with the same kernel (although I can't tell how
> reliable this is).

If you mean that the same part of the buffer header gets smashed every
time, yes, this is reliably reproducible (well, in other words, when
it happens (at random), it happens in the same place every time).  It
may mean that Vinum is doing it, but as far as I can tell it's always
6 words being zeroed out, and I don't do that anywhere in Vinum.  The
other possibility, which I consider most likely, is that the data
structures accidentally get freed and used by some other driver (or,
possibly, that some other driver freed them first and then continued
using them).  This would explain the observed correlation with the fxp
driver.

> My experiences with kernel code and kernel-debugging with
> ddb are very limited. So is my time (I know this applies
> to anyone). Therefore I ceased spending time to set up
> remote-gdb sessions and sending you stack traces trying to be
> helpful, since you obviously didn't seem to be interested.
>
> I further decided not to use vinum any more. We spent some
> cash on a few hardware RAIDs, and the boxes run smooth now,
> since.
>
> I am just writing this to state:
>  a) I did respond to your requests, trying to be as helpful as
>     I could.

Well, I sent you a message on 10 September 2000, asking for additional
information.  You didn't send it to me.

>      You could blame me for not knowing or willing to learn how to
>     set up a ddb/gdb session using watchpoints and waiting for the
>     next crash in an environmen that should be productive (and now
>     is).

No, I wouldn't do that.

>  b) I still believe, that there is a problem somewhere in the
>     vinum code (probably within raid5 routines, since a mirror
>     setup worked fine).

Correct.  I have no doubt about it.  But some bugs are difficult to
find, and I need help.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message