Date: Tue, 10 Jun 2003 15:25:46 +0000 From: Daniela <dgw@liwest.at> To: Robert Watson <rwatson@freebsd.org> Cc: stable@freebsd.org Subject: Re: Server overloaded? Or is it a bug? Message-ID: <200306101525.46714.dgw@liwest.at> In-Reply-To: <Pine.NEB.3.96L.1030605154904.54608C-100000@fledge.watson.org> References: <Pine.NEB.3.96L.1030605154904.54608C-100000@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 05 June 2003 20:19, Robert Watson wrote: > So this tells us that interrupt delivery appears to be working fine for > your NIC, that the network stack isn't completely hosed, and can allocate > packet buffers (mbufs), so isn't memory-starved at that level of the > system. > Sockets are used only for locally terminated connections, and come out of > a separate memory pool from packet buffers (well, it's a little more > complicated than that, but that's enough to get the picture). The reason > I wondered about this was that one of the classes of possible memory > starvation is to reach the allocation limit on sockets. We allocate the > socket (and TCP state) a couple of packets into the TCP setup, so if the > TCP setup got partway completed and then there was no further response, > we'd have a possible explanation. > > Since the connection completes, it's probably safe to assume the TCP state > and socket were fully allocated, and the socket was returned by the kernel > to the application, or at least, the kernel got pretty much to the point > of returning it to the application. > Try using "slogin -v" or "ssh -v" on the client, and paste the results > into an e-mail in response to this one. The SSH daemon does a lot of work > to set up a new connection -- it forks a process or two, does name > lookups, allocates pseudo-terminals, invokes PAM, and all kinds of other > things. There are failure modes for each of these, and a bit more detail > might let us track it down. Particularly useful might be the results of > "slogin -v" both when the machine is operating normally, and when it's > hosed. This will let us figure out about when during the process > something failed, and what it might have been doing. > > > > If you can get partway through the banner but hang later, that > > > might be the result of a file system deadlock of some sort. > > > > This is also possible, but what could have caused it? My file I/O is not > > really heavy. > > Deadlock is a bit of a misnomer for what I have in mind. There are two > classes of things that look like deadlocks: lock order problems, and lock > leaks. ... > So the VFS deadlock is somewhat of a shot in the dark, but it has pretty > easy to identify symptoms, especially if you can get to a debugger. > They're also fairly easy to analyze. ... > I think we'll find that it's either a kernel problem, or an X problem > triggering a kernel problem, so we're unlikely to find useful core dumps > from applications. A system core might be useful, but hard to get without > a serial console. > > Ok, so at the end of this all, here were my pieces of advice on debugging > it, if you can reproduce it: > > (1) Compare "slogin -v" to the system in the before and after scenarios, > that may tell us a lot about what's broken. > > (2) Despite the fact that you can't set up a serial console, set up a > serial console. ... Some strange things happened these days, they were all related to processes: (1) I have some zombies I cannot kill: # ps ax ... 53410 pn Z 0:00.00 (kate) ... # kill -9 53410 53410: No such process The same thing happens with make. (2) When I invoke the KDE System Guard, the process list won't show up. (3) My processes recieve a lot of signals (10 and 11), about 30 times a day. (4) Kate crashed when I wanted to save a document, and then every time I opened it. So I tried gdb kate: (gdb) run Starting program: /usr/local/bin/kate Deprecated bfd_read called at /usr/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c line 2627 in elfstab_build_psymtabs Deprecated bfd_read called at /usr/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c line 933 in fill_symbuf ERROR: Communication problem with kate, it probably crashed. Program exited with code 0377. As I never had any problems like these, I guess they are a side effect of the crash. Do we have a chance to debug this or should I rebuild my system? And, most imortant, could this be a new kernel bug? If yes, I would really like to debug it. Daniela
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200306101525.46714.dgw>