Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 27 Jun 2008 09:14:56 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Ali Niknam <freebsd-net@transip.nl>
Cc:        net@freebsd.org
Subject:   Re: FreeBSD 7.0: sockets stuck in CLOSED state...
Message-ID:  <20080627090939.M78484@fledge.watson.org>
In-Reply-To: <20080626081831.V96707@fledge.watson.org>
References:  <486283B0.3060805@transip.nl> <20080625195523.N29013@fledge.watson.org> <4862BCF5.4070900@transip.nl> <20080626081831.V96707@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 26 Jun 2008, Robert Watson wrote:

> I think the first logical step is to wait for the application to get into 
> that state again, and then run procstat or fstat to dump the file descriptor 
> away for the process.  Presumably in the normal steady state, you expect to 
> see a few IPC sockets (syslog, etc), a TCP listen socket, and some number of 
> in-progress TCP sessions.  The question, of course, is whether you see a lot 
> more file descriptors than that, and in particular, ones that matched the 
> CLOSED entries in netstat.  If you find that there are lots of open file 
> descriptors and they match up approximately with netstat, then it's an 
> application bug that just manifests a bit differently in 7.x than in 6.x. 
> On the other hand, if you see only a small number of open file descriptors, 
> then we may be looking at something quite a bit more complicated.

Just a public followup for those following the thread: Ali has sent me netstat 
and sockstat/fstat data.  It looks like each of the TCP connections appear in 
the netstat output in the CLOSED state also appears in sockstat with a file 
descriptor.

This suggests an bug in which file descriptors are occasionally leaked, 
perhaps early in their life cycle as there's a bit of data in the input 
buffer.  However, it's unclear still if it's an application bug (occasionally 
missing a close() on an accepted file descriptor) or a kernel bug (accept() or 
close() misbehaving such that the application doesn't know the file descriptor 
is open, or has tried to close it but no succeeded). Ali mentioned in his 
e-mail that he was seeing EBADF on occasion from close(), which could mean a 
bug is causing the wrong file descriptor number to be passed in.  If there's a 
kernel bug involved, then you could imagine it being along the lines of 
"accept(2) returns a file descriptor but also sets an error, so the 
application simply sees the error but the file descriptor remains installed in 
the process's file descriptor table", leading to the appearance of a leak.

I've asked Ali to do a bit more debugging and tracing of the application to 
see if we can reach any conclusions about this.  In particular, if he traces 
to a file all file descriptor numbers returned by accept(2), then we can later 
compare that file with the leaked descriptors present in netstat/sockstat and 
decide whether the application *should* have known they were open or not.

I also spotted a bug in the netstat/sockstat output, unrelated, in which the 
port number of the inpcb is cleared when the connection closes, meaning that 
netstat shows '*' as the port number.  This isn't really necessary, but does 
lead to potentially confusing output.

Robert N M Watson
Computer Laboratory
University of Cambridge



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080627090939.M78484>