From owner-freebsd-stable@FreeBSD.ORG  Sun Jun  8 16:21:08 2003
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8D67537B401; Sun,  8 Jun 2003 16:21:08 -0700 (PDT)
Received: from lilzcluster.liwest.at (lilzclust01.liwest.at [212.33.55.11])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 950AE43FCB; Sun,  8 Jun 2003 16:21:06 -0700 (PDT)
	(envelope-from dgw@liwest.at)
Received: from cm58-27.liwest.at by lilzcluster.liwest.at
	(8.10.2/1.1.2.11/08Jun01-1123AM)
	id h58NL1I0001034692; Mon, 9 Jun 2003 01:21:01 +0200 (MEST)
From: Daniela <dgw@liwest.at>
To: Robert Watson <rwatson@freebsd.org>
Date: Mon, 9 Jun 2003 01:21:04 +0000
User-Agent: KMail/1.5.1
References: <Pine.NEB.3.96L.1030605154904.54608C-100000@fledge.watson.org>
In-Reply-To: <Pine.NEB.3.96L.1030605154904.54608C-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200306090121.04733.dgw@liwest.at>
cc: stable@freebsd.org
Subject: Re: Server overloaded? Or is it a bug?
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Production branch of FreeBSD source code
	<freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Jun 2003 23:21:08 -0000

On Thursday 05 June 2003 20:19, Robert Watson wrote:
> Sockets are used only for locally terminated connections, and come out of
> a separate memory pool from packet buffers (well, it's a little more
> complicated than that, but that's enough to get the picture).  The reason
> I wondered about this was that one of the classes of possible memory
> starvation is to reach the allocation limit on sockets.  We allocate the
> socket (and TCP state) a couple of packets into the TCP setup, so if the
> TCP setup got partway completed and then there was no further response,
> we'd have a possible explanation.
>
> Since the connection completes, it's probably safe to assume the TCP state
> and socket were fully allocated, and the socket was returned by the kernel
> to the application, or at least, the kernel got pretty much to the point
> of returning it to the application.

I'm almost sure that the socket was returned. It hanged right after pressing 
Enter at the SSH password prompt. If I get it right, the connection must be 
established to get to this point.

> Try using "slogin -v" or "ssh -v" on the client, and paste the results
> into an e-mail in response to this one.  The SSH daemon does a lot of work
> to set up a new connection -- it forks a process or two, does name
> lookups, allocates pseudo-terminals, invokes PAM, and all kinds of other
> things.  There are failure modes for each of these, and a bit more detail
> might let us track it down.  Particularly useful might be the results of
> "slogin -v" both when the machine is operating normally, and when it's
> hosed.  This will let us figure out about when during the process
> something failed, and what it might have been doing.

Couldn't try ssh -v. I was on a Windoze machine where I only had an awful 
graphical SSH client.
I guess it hanged when it tried to fork or read the password file.

> > >     If you can get partway through the banner but hang later, that
> > > might be the result of a file system deadlock of some sort.
> >
> > This is also possible, but what could have caused it? My file I/O is not
> > really heavy.
>
> Deadlock is a bit of a misnomer for what I have in mind.  There are two
> classes of things that look like deadlocks: lock order problems, and lock
> leaks.
>
> Lock order problems are real deadlocks, where you grab locks in the wrong
> order -- they tend to occur under high load, since race windows open up
> improving the chances of a problem, as well as increasing the probability
> of it occuring due to a high number of operations.  Common activities that
> increase the chance of a lock order reversal in FreeBSD's VFS include
> simultaneous use of chroot(), quotas, and vnode-backed vn/md devices.
> Quotas and vnodes both violate the lock order (although in ways that
> hardly ever manifest in practice), and chroot() tend to create less common
> lock aquisition orders for applications when running in kernel.  Nullfs is
> also a common cause of problems.  I think most of these are unlikely to be
> the problem in your environment, especially given that you don't have a
> massively high load with tens of thousands of simultaneous processes all
> installing world in chroot()'s on vn-backed file systems with quotas.

I'm not using any of these.

> The second class of problems relates to lock leaks, which occur in unusual
> failure modes.  The implementation neglects to release a lock under some
> scenario, and the result is that no other process can ever acquire the
> lock.  These are relatively rare, but once in a while we bump into one,
> and it's a bit of a pain to debug.  The symptoms are very similar to a
> deadlock, since gradually processes stack up trying to acquire the lock
> while holding other locks, and typically this results in a "race to root",
> in which sets of processes hold pairs of locks down the file hierarchy,
> and eventually the root vnode lock can't be grabbed, so all processes
> doing name lookups from the root hang.  (Ouch).  NFS can also trigger
> races to roots: if an NFS server hangs, NFS client processes may be
> holding a vnode lock when the NFS server ceases to respond.  If processes
> hold multiple locks at a time (such as during lookup), this can also
> result in a race to the root.  There are some changes to -CURRENT
> submitted by Jeff Roberson, which greatly reduce the chances of this
> happening.  Since you're not using NFS, I believe, it's unlikely to relate
> to this.

I have an NFS server (at least I'm trying to set one up).

> Hmm.  That sucks; a serial console is one of the single most useful
> debugging tools available, since it allows you to track the state of the
> system while the GUI is running.  Are you sure you can't? :-)  It can be
> an old IBM XT with a NULL modem cable...

I really have nothing I could use to set up a serial console.

> > I already have debug symbols everywhere. I have alredy rebooted, and I'm
> > now looking for application core dumps (however, I don't think an
> > application crashed). Maybe I can reproduce it, I still know everything
> > I did.
>
> I think we'll find that it's either a kernel problem, or an X problem
> triggering a kernel problem, so we're unlikely to find useful core dumps
> from applications.  A system core might be useful, but hard to get without
> a serial console.

If the kernel panicked, I should have got a core dump, so we know it did not 
(maybe this information helps).

Could this eventually be a DoS attack? Already had one, and the symptoms were 
similar. But this time I had almost no internet traffic (or the attacker had 
already stopped when I looked).

> Ok, so at the end of this all, here were my pieces of advice on debugging
> it, if you can reproduce it:
>
> (1) Compare "slogin -v" to the system in the before and after scenarios,
>     that may tell us a lot about what's broken.
>
> (2) Despite the fact that you can't set up a serial console, set up a
>     serial console.
>
> :-)
>
> Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
> robert@fledge.watson.org      Network Associates Laboratories