From owner-freebsd-stable@FreeBSD.ORG Sun Jun 8 16:21:08 2003 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8D67537B401; Sun, 8 Jun 2003 16:21:08 -0700 (PDT) Received: from lilzcluster.liwest.at (lilzclust01.liwest.at [212.33.55.11]) by mx1.FreeBSD.org (Postfix) with ESMTP id 950AE43FCB; Sun, 8 Jun 2003 16:21:06 -0700 (PDT) (envelope-from dgw@liwest.at) Received: from cm58-27.liwest.at by lilzcluster.liwest.at (8.10.2/1.1.2.11/08Jun01-1123AM) id h58NL1I0001034692; Mon, 9 Jun 2003 01:21:01 +0200 (MEST) From: Daniela To: Robert Watson Date: Mon, 9 Jun 2003 01:21:04 +0000 User-Agent: KMail/1.5.1 References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200306090121.04733.dgw@liwest.at> cc: stable@freebsd.org Subject: Re: Server overloaded? Or is it a bug? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Jun 2003 23:21:08 -0000 On Thursday 05 June 2003 20:19, Robert Watson wrote: > Sockets are used only for locally terminated connections, and come out of > a separate memory pool from packet buffers (well, it's a little more > complicated than that, but that's enough to get the picture). The reason > I wondered about this was that one of the classes of possible memory > starvation is to reach the allocation limit on sockets. We allocate the > socket (and TCP state) a couple of packets into the TCP setup, so if the > TCP setup got partway completed and then there was no further response, > we'd have a possible explanation. > > Since the connection completes, it's probably safe to assume the TCP state > and socket were fully allocated, and the socket was returned by the kernel > to the application, or at least, the kernel got pretty much to the point > of returning it to the application. I'm almost sure that the socket was returned. It hanged right after pressing Enter at the SSH password prompt. If I get it right, the connection must be established to get to this point. > Try using "slogin -v" or "ssh -v" on the client, and paste the results > into an e-mail in response to this one. The SSH daemon does a lot of work > to set up a new connection -- it forks a process or two, does name > lookups, allocates pseudo-terminals, invokes PAM, and all kinds of other > things. There are failure modes for each of these, and a bit more detail > might let us track it down. Particularly useful might be the results of > "slogin -v" both when the machine is operating normally, and when it's > hosed. This will let us figure out about when during the process > something failed, and what it might have been doing. Couldn't try ssh -v. I was on a Windoze machine where I only had an awful graphical SSH client. I guess it hanged when it tried to fork or read the password file. > > > If you can get partway through the banner but hang later, that > > > might be the result of a file system deadlock of some sort. > > > > This is also possible, but what could have caused it? My file I/O is not > > really heavy. > > Deadlock is a bit of a misnomer for what I have in mind. There are two > classes of things that look like deadlocks: lock order problems, and lock > leaks. > > Lock order problems are real deadlocks, where you grab locks in the wrong > order -- they tend to occur under high load, since race windows open up > improving the chances of a problem, as well as increasing the probability > of it occuring due to a high number of operations. Common activities that > increase the chance of a lock order reversal in FreeBSD's VFS include > simultaneous use of chroot(), quotas, and vnode-backed vn/md devices. > Quotas and vnodes both violate the lock order (although in ways that > hardly ever manifest in practice), and chroot() tend to create less common > lock aquisition orders for applications when running in kernel. Nullfs is > also a common cause of problems. I think most of these are unlikely to be > the problem in your environment, especially given that you don't have a > massively high load with tens of thousands of simultaneous processes all > installing world in chroot()'s on vn-backed file systems with quotas. I'm not using any of these. > The second class of problems relates to lock leaks, which occur in unusual > failure modes. The implementation neglects to release a lock under some > scenario, and the result is that no other process can ever acquire the > lock. These are relatively rare, but once in a while we bump into one, > and it's a bit of a pain to debug. The symptoms are very similar to a > deadlock, since gradually processes stack up trying to acquire the lock > while holding other locks, and typically this results in a "race to root", > in which sets of processes hold pairs of locks down the file hierarchy, > and eventually the root vnode lock can't be grabbed, so all processes > doing name lookups from the root hang. (Ouch). NFS can also trigger > races to roots: if an NFS server hangs, NFS client processes may be > holding a vnode lock when the NFS server ceases to respond. If processes > hold multiple locks at a time (such as during lookup), this can also > result in a race to the root. There are some changes to -CURRENT > submitted by Jeff Roberson, which greatly reduce the chances of this > happening. Since you're not using NFS, I believe, it's unlikely to relate > to this. I have an NFS server (at least I'm trying to set one up). > Hmm. That sucks; a serial console is one of the single most useful > debugging tools available, since it allows you to track the state of the > system while the GUI is running. Are you sure you can't? :-) It can be > an old IBM XT with a NULL modem cable... I really have nothing I could use to set up a serial console. > > I already have debug symbols everywhere. I have alredy rebooted, and I'm > > now looking for application core dumps (however, I don't think an > > application crashed). Maybe I can reproduce it, I still know everything > > I did. > > I think we'll find that it's either a kernel problem, or an X problem > triggering a kernel problem, so we're unlikely to find useful core dumps > from applications. A system core might be useful, but hard to get without > a serial console. If the kernel panicked, I should have got a core dump, so we know it did not (maybe this information helps). Could this eventually be a DoS attack? Already had one, and the symptoms were similar. But this time I had almost no internet traffic (or the attacker had already stopped when I looked). > Ok, so at the end of this all, here were my pieces of advice on debugging > it, if you can reproduce it: > > (1) Compare "slogin -v" to the system in the before and after scenarios, > that may tell us a lot about what's broken. > > (2) Despite the fact that you can't set up a serial console, set up a > serial console. > > :-) > > Robert N M Watson FreeBSD Core Team, TrustedBSD Projects > robert@fledge.watson.org Network Associates Laboratories