From owner-freebsd-current  Sat Nov  2  3:31:42 2002
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D629037B401
	for <current@freebsd.org>; Sat,  2 Nov 2002 03:31:38 -0800 (PST)
Received: from flamingo.mail.pas.earthlink.net (flamingo.mail.pas.earthlink.net [207.217.120.232])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3B18843E97
	for <current@freebsd.org>; Sat,  2 Nov 2002 03:31:38 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0019.cvx21-bradley.dialup.earthlink.net ([209.179.192.19] helo=mindspring.com)
	by flamingo.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 187wV2-0000LE-00; Sat, 02 Nov 2002 03:31:32 -0800
Message-ID: <3DC3B701.58AA03ED@mindspring.com>
Date: Sat, 02 Nov 2002 03:29:05 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Michal Mertl <mime@traveller.cz>
Cc: Bill Fenner <fenner@research.att.com>, current@freebsd.org
Subject: Re: crash with network load (in tcp syncache ?)
References: <Pine.BSF.4.41.0211020937210.87031-100000@prg.traveller.cz>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG

Michal Mertl wrote:
> Do I read you correctly that Bill's patch is probably better than yours
> (I tested both, both fix the problem)?

That's a hard question, and it takes a longer answer.  8-(.

They fix the problem different ways.  The problem is actually
a secondary effect.  There are several ways to trigger it.  Mine
fixes it by initializing the socket to a valid value on the list,
and Bill's fixes it by initializing it to a valid value off the
list.

Mine will fail under load when the protocol attach fails; the way
it works is that the protocol attach succeeds before the soreserve()
fails, so it's possible to undo the attach, which happens in the
sotryfree().  It's a good fix because it ups the reference count,
and destroys the socket normally (in the caller) on failure.

Bill's won't fail when the protocol attach fails, but it will fail
under other conditions.  For example, if you were to up the amount
of physical RAM in your box, Bill's might start failing, or if you
up'ed the mbuf allocations by manually tuning them larger, Bill's
would definititely fail when you ran out of mbuf clusters, but not
mbufs.  Both of these failures require you to hit the cookie code
(the SYN-cache load getting too high).

Both of them are poor workarounds for a problem, which is really
that some of the code that's being called by the SYN-cache code
to do delayed allocation of resources until a matching ACK, was
never written to be callable at NETISR, and the allocation occurs
in the wrong order.

Bill's fix is marginally better, because it will handle one more
case than mine (but I believe it will actually leak sockets on the
failure case, when you are at resource starvation).

Both of them are "voodoo": they rely on causing a different side
effect of a side effect.  As voodoo goes, Bill's is marginally
less invisible than mine, so I've suggested that Mark Murray
commit Bill's, instead of mine, but without reading the code,
just seeing either patch, no one would know what the heck the
patch was intended to do, or why it was needed at all... both
of them look like you are gratuitously moving code around for no
good reason.  8-).


> If you still believe there's a problem (bug) I may trigger with some
> setting please tell me. I don't know how to make syncookies kick in - I
> set net.inet.tcp.cachelimit to 100 but it doesn't seem to make a
> difference but I don't know what am I doing :-). I imagine syncache
> doesn't grow much when I'm connecting from signle IP and connections are quickly
> eastablished. I'll be able to do some tests on monday - this is a computer
> at work.

The problem is that you've tuned your kernel for more committed
memory than you actually have available... you are overcommiting
socket receive buffers (actually, 16K sockets at the current default
would need a full 4G of physical RAM, if there weren't overcommit).

The real fix would be to make the code insensitive to allocation
failures at all points in the process.  Like I said before, it would
require passing the address of the 'so' pointer to one of the underlying
functions, so that all the initialization could be done in one place
(the attach routine would be best).  This would change the protocol
interface for all the protocols, so it's a hard change to sell.


If you want to cause your kernel to freak, even with Bill's patch,
in your kernel config file, increase the number of mbuf's, but not
the number of mbuf clusters (manually tune up the number of mbufs).
This is a boundary failure, and it's possible to cause it to happen
anyway, just by adding RAM, now that Matt Dillon's auto-tuning code
has gone in (the ratio of increase for more RAM is not 1:1 for these
resources).

If you want to see it die slowly, run it at high load; you should
see from "vmstat -m" that, for every allocation failure on an
incoming connection, you leak a SYN cache entry and an associated
socket structure.  Eventually, your box will lock up, but you may
have to run a week or more to get it to do that, unless you have a
gigabit NIC, and can keep it saturated with connect requests (then
it should lock up in about 36 hours).  With my patch, instead of
locking up, it panic's again (I guess that's a plus, if you prefer
it to reboot and start working again, and don't have a status
monitor daemon on another box that can force the reboot).

If you want it to panic with my patch, tune the number of maxfiles
way, way up.  When the in_pcballoc() fails in tcp_attach, then it
will blow up (say around 40,000 connections or so).  If you try this,
remember that the sysctl for maxfiles is useless for networking
connections: you have to tune in the boot loader, or in the kernel
config for the tuning to have the correct effect on the number of
network connections.

Actually, if you look at the tcp_attach() code in tcp_usrreq.c,
you'll see that it also calls soreserve(); probably, the soreserve()
in the sonewconn() code is plain bogus, but I didn't want to remove
it in the 0/0 case for high/low watermarks on the socket (e.g. set
of the buffer size as "inherited" from the listen socket), where the
sonewconn() calls it anyway (this is one of Bill's leaks; actually,
it doesn't belong to Bill, it belongs to the FreeBSD code, and
Bill's fix keeps it from panic'ing enough to happen).


What it basically boils down to is that you could avoid the panic
without any code changes, just by setting some better administrative
limits, so that you never hit the case where the panic happens (the
panic happens when you attempt to allocate more resources than you
have, and it's just as good to refrain from attempting an allocation
when there is no memory left, as it is to fix the code exploding when
an allocation fails).


> FWIW netstat -m during the benchmark run shows (I read it that it doesn't
> have problem - even just before the crash):

[ ... ]

The actual number that's most useful is the number of refused
allocation requests, which was non-zero in your original post...
I seem to remember 380, but I didn't pay attention, and it was
a while ago -- that it was non-zero is what's important (it's
zero in this one, for other reasons, having to do with my patch;
the stats don't count pcballoc failures, which is what my patch
causes to happen, instead).

In any case, the number of resources attempted to be used is in
excess of those available.  Basically, you're hitting what's
called "livelock".  The best way to handle it is actually to
reduce the processing latency, so that your resources are not
tied up in doing processing.  You can't really do that, with the
SYN-cache code running at NETISR, instead of as a result of the
interrupt or poll event.  This is why NETISR (or interrupt threads)
are actually not a very good idea, when you are hitting high load
conditions.

If you are worried, run with both patches; they overlap a little
(around your original problem), but they both don't fix the
underlying problem, which can squish out the cracks in about 7
places.

If you want a perfect fix, someone ws going to have to step in and
refactor the code, with the idea of not having a context in which
the operation is taking place (process vs. NETISR vs. interrupt).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message