From owner-freebsd-current Sat Nov 2 3:31:42 2002 Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D629037B401 for ; Sat, 2 Nov 2002 03:31:38 -0800 (PST) Received: from flamingo.mail.pas.earthlink.net (flamingo.mail.pas.earthlink.net [207.217.120.232]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3B18843E97 for ; Sat, 2 Nov 2002 03:31:38 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0019.cvx21-bradley.dialup.earthlink.net ([209.179.192.19] helo=mindspring.com) by flamingo.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 187wV2-0000LE-00; Sat, 02 Nov 2002 03:31:32 -0800 Message-ID: <3DC3B701.58AA03ED@mindspring.com> Date: Sat, 02 Nov 2002 03:29:05 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Michal Mertl Cc: Bill Fenner , current@freebsd.org Subject: Re: crash with network load (in tcp syncache ?) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Michal Mertl wrote: > Do I read you correctly that Bill's patch is probably better than yours > (I tested both, both fix the problem)? That's a hard question, and it takes a longer answer. 8-(. They fix the problem different ways. The problem is actually a secondary effect. There are several ways to trigger it. Mine fixes it by initializing the socket to a valid value on the list, and Bill's fixes it by initializing it to a valid value off the list. Mine will fail under load when the protocol attach fails; the way it works is that the protocol attach succeeds before the soreserve() fails, so it's possible to undo the attach, which happens in the sotryfree(). It's a good fix because it ups the reference count, and destroys the socket normally (in the caller) on failure. Bill's won't fail when the protocol attach fails, but it will fail under other conditions. For example, if you were to up the amount of physical RAM in your box, Bill's might start failing, or if you up'ed the mbuf allocations by manually tuning them larger, Bill's would definititely fail when you ran out of mbuf clusters, but not mbufs. Both of these failures require you to hit the cookie code (the SYN-cache load getting too high). Both of them are poor workarounds for a problem, which is really that some of the code that's being called by the SYN-cache code to do delayed allocation of resources until a matching ACK, was never written to be callable at NETISR, and the allocation occurs in the wrong order. Bill's fix is marginally better, because it will handle one more case than mine (but I believe it will actually leak sockets on the failure case, when you are at resource starvation). Both of them are "voodoo": they rely on causing a different side effect of a side effect. As voodoo goes, Bill's is marginally less invisible than mine, so I've suggested that Mark Murray commit Bill's, instead of mine, but without reading the code, just seeing either patch, no one would know what the heck the patch was intended to do, or why it was needed at all... both of them look like you are gratuitously moving code around for no good reason. 8-). > If you still believe there's a problem (bug) I may trigger with some > setting please tell me. I don't know how to make syncookies kick in - I > set net.inet.tcp.cachelimit to 100 but it doesn't seem to make a > difference but I don't know what am I doing :-). I imagine syncache > doesn't grow much when I'm connecting from signle IP and connections are quickly > eastablished. I'll be able to do some tests on monday - this is a computer > at work. The problem is that you've tuned your kernel for more committed memory than you actually have available... you are overcommiting socket receive buffers (actually, 16K sockets at the current default would need a full 4G of physical RAM, if there weren't overcommit). The real fix would be to make the code insensitive to allocation failures at all points in the process. Like I said before, it would require passing the address of the 'so' pointer to one of the underlying functions, so that all the initialization could be done in one place (the attach routine would be best). This would change the protocol interface for all the protocols, so it's a hard change to sell. If you want to cause your kernel to freak, even with Bill's patch, in your kernel config file, increase the number of mbuf's, but not the number of mbuf clusters (manually tune up the number of mbufs). This is a boundary failure, and it's possible to cause it to happen anyway, just by adding RAM, now that Matt Dillon's auto-tuning code has gone in (the ratio of increase for more RAM is not 1:1 for these resources). If you want to see it die slowly, run it at high load; you should see from "vmstat -m" that, for every allocation failure on an incoming connection, you leak a SYN cache entry and an associated socket structure. Eventually, your box will lock up, but you may have to run a week or more to get it to do that, unless you have a gigabit NIC, and can keep it saturated with connect requests (then it should lock up in about 36 hours). With my patch, instead of locking up, it panic's again (I guess that's a plus, if you prefer it to reboot and start working again, and don't have a status monitor daemon on another box that can force the reboot). If you want it to panic with my patch, tune the number of maxfiles way, way up. When the in_pcballoc() fails in tcp_attach, then it will blow up (say around 40,000 connections or so). If you try this, remember that the sysctl for maxfiles is useless for networking connections: you have to tune in the boot loader, or in the kernel config for the tuning to have the correct effect on the number of network connections. Actually, if you look at the tcp_attach() code in tcp_usrreq.c, you'll see that it also calls soreserve(); probably, the soreserve() in the sonewconn() code is plain bogus, but I didn't want to remove it in the 0/0 case for high/low watermarks on the socket (e.g. set of the buffer size as "inherited" from the listen socket), where the sonewconn() calls it anyway (this is one of Bill's leaks; actually, it doesn't belong to Bill, it belongs to the FreeBSD code, and Bill's fix keeps it from panic'ing enough to happen). What it basically boils down to is that you could avoid the panic without any code changes, just by setting some better administrative limits, so that you never hit the case where the panic happens (the panic happens when you attempt to allocate more resources than you have, and it's just as good to refrain from attempting an allocation when there is no memory left, as it is to fix the code exploding when an allocation fails). > FWIW netstat -m during the benchmark run shows (I read it that it doesn't > have problem - even just before the crash): [ ... ] The actual number that's most useful is the number of refused allocation requests, which was non-zero in your original post... I seem to remember 380, but I didn't pay attention, and it was a while ago -- that it was non-zero is what's important (it's zero in this one, for other reasons, having to do with my patch; the stats don't count pcballoc failures, which is what my patch causes to happen, instead). In any case, the number of resources attempted to be used is in excess of those available. Basically, you're hitting what's called "livelock". The best way to handle it is actually to reduce the processing latency, so that your resources are not tied up in doing processing. You can't really do that, with the SYN-cache code running at NETISR, instead of as a result of the interrupt or poll event. This is why NETISR (or interrupt threads) are actually not a very good idea, when you are hitting high load conditions. If you are worried, run with both patches; they overlap a little (around your original problem), but they both don't fix the underlying problem, which can squish out the cracks in about 7 places. If you want a perfect fix, someone ws going to have to step in and refactor the code, with the idea of not having a context in which the operation is taking place (process vs. NETISR vs. interrupt). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message