From owner-freebsd-stable  Wed May 12  2: 8:59 1999
Delivered-To: freebsd-stable@freebsd.org
Received: from lazlo.internal.steam.com (lazlo.steam.com [199.108.84.37])
	by hub.freebsd.org (Postfix) with ESMTP
	id 9DD4614DD7; Wed, 12 May 1999 02:08:55 -0700 (PDT)
	(envelope-from cliff@steam.com)
Received: from lazlo.internal.steam.com (cliff@lazlo.internal.steam.com [192.168.32.2])
	by lazlo.internal.steam.com (8.9.3/8.9.3) with ESMTP id CAA02414;
	Wed, 12 May 1999 02:09:01 -0700 (PDT)
Date: Wed, 12 May 1999 02:09:00 -0700 (PDT)
From: Cliff Skolnick <cliff@steam.com>
X-Sender: cliff@lazlo.internal.steam.com
To: David Greenman <dg@root.com>
Cc: Mike Tancsa <mike@sentex.net>, freebsd-stable@FreeBSD.ORG,
	luoqi@FreeBSD.ORG, Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: vm_fault deadlock and PR 8416 ... NOT fixed! 
In-Reply-To: <199905120755.AAA01361@implode.root.com>
Message-ID: <Pine.BSF.4.10.9905120057500.614-100000@lazlo.internal.steam.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Wed, 12 May 1999, David Greenman wrote:

> >Well a few minutes ago my system went into deadlock - and this is with the
> >kern_lock.c dated 5/11.  This patch is different than the one in 8416 that
> >solved my problem before.  I'd say this the problem is still there.
> 
>    Time is very short for getting this fixed before the release deadline. I
> think Luoqi's patch that was in the PR was suseptible to a priority inversion
> problem and has risks associated with using it. The fix that Matt Dillion
> made for -current that I back-ported to -stable was an attempt to fix the
> problem while minimizing the side effects. If it doesn't fix the problem
> then we'll proceed with plan B which is probably to just go with Luoqi's
> fix or to possibly troubleshoot Matt's fix (but as I said, time is short).

I'll do whatever I can do, I've been looking at the code and am thinking
about some debugging strategies to try and figure out what is happening.
Unfortunately this code is complex in regards to potential interactions, so
I am going through this quite slowly.

I really can't afford to back out the patch at this point, but I can add any
code to my kernel to gather debugging info for people working on this.  It
would be probably better for you to tell me what you want to know than for
me to guess.  I have lots of disk space, so super verbose stuff is OK.

My system is a PII 350 with 128MB of memory and 512MB swap, 3 SCSI buses (a
2940 and a 3940), an intel etherexpress 10/100, and a zynx 4 port 10/100
card.  This problem started the begining of this month after an installworld
and kernel update.  This system had been running 3.1 since 3.1-release and
2.2.x since mid summer without a single crash, well except for that bad SCSI
cable in september which was my fault.

The problem does not occur until there is some paging activity.  It also
usually seems to happen when users with large mailboxes are writing to their
mail spools.  Usually there are 2-6 people reading mail on the machine at
any given time, 15 active users in total.  20% of the users use PINE
accessing the spool directly, 20% use pine but accessing the spool via IMAP,
and 60% use IMAP only from other systems (at times also with a shell open on
the system but not reading mail).  It seems the first two pine categories
are much more likely to cause the problem, but those users also are the ones
with bigger spool files.

The last deadlock was me running screen, pine, and compiling egcs.  There
was one other user logged into the system with an idle shell open, the other
user was also reading mail via IMAP from netscape.  This is not really a
heavy load for this system, but with 128MB it will probably start to use
swap at this point.

The same machine runs:

  samba, active mounts - no tranfers in progress
  apache, no requests being served
  DNS, constant traffic, but not high volume
  sendmail, was attempting a few local and about 10 remote deliveries at
    the time.
  IMAP had two connections active, my pine session and the netscape session
    on a remote machine
  No X server is ever run on this machine, ssh, telnet and an occasional
    console login is how this machine is used.
  Routing, it's the router between a few very low use LANs and the
    firewall router.

> 
> >Once again my server is useless, deadlocked.  No panic, responding to pings,
> >no ability to do disk I/O or any VM related stuff.
> >
> >An unhappy freebsd user once again,
> 
>    Is this really necessary? It sure doesn't help the debugging process.

This was not a flame, but an expression of my frustration.  Necessary, no.
True, yes. I make my living helping other people with their systems figuring
out solutions.  I've recommended FreeBSD a hell of a lot, it is my #1 choice
due to it's stability.  I used to use linux, but stopped recommending Linux
for anything except a firewall almost a year ago after bad luck with most
any user level process running for a long period of time, especially named.

When I install FreeBSD for or at a client site and it crashes, it is I who
looks bad.  I know there will be more mail in my mailbox about people that
can't get access to their mail and web sites that are down.  I think it is
perfectly reasonable to be unhappy, and to express that in a limited manner.

Anyways, it's late and I hope this post made sense.  My writing skills
decrease rapidly when I'm falling asleep at the keyboard.

Cliff

--
Cliff Skolnick          | "They that can give up essential liberty to obtain
Steam Tunnel Operations |  a little temporary safety deserve neither liberty
cliff@steam.com         |  nor safety."
http://www.steam.com/   |                   -- Benjamin Franklin, 1759


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message