Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 29 Mar 2006 12:05:51 +0000 (GMT)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        current@FreeBSD.org
Cc:        Randall Stewart <rrs@cisco.com>
Subject:   Re: REMINDER: Re: HEADS UP: network stack and socket hackery over the next few weeks
Message-ID:  <20060329115028.C19236@fledge.watson.org>
In-Reply-To: <20060329100839.V19236@fledge.watson.org>
References:  <20060317141627.W2181@fledge.watson.org> <20060329100839.V19236@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Wed, 29 Mar 2006, Robert Watson wrote:

> As a reminder, April 1 is now three days away.  On April 1, I will be 
> committed an extensive set of socket and netinet changes which will likely 
> render the network stack broken.  I say this with some confidence because I 
> have tested the changes fairly extensively, as have a number of other 
> developers, and they appear to mostly work.  Therefore, they will be broken 
> :-).  I will be posting updated versions of these patches shortly, but 
> unless we run into show-stopper serious instability with them, rather than 
> nits, I will commit them (in their updated form) on April 1 shortly after 
> the netatm build is disabled.
>
> I will post another HEADS UP as the changes go into the tree, and will be 
> monitoring things closely to try and get any bugs that might turn up fixed 
> as quickly as possible.  As an FYI, I will be travelling the weeks of April 
> 6 - April 21, but will be online frequently, and working for several days in 
> the Bay Area during the trip.  Please report bugs relating to this work to 
> current@.

An updated version of the patch is now available for download at:

   http://www.watson.org/~robert/freebsd/netperf/20060329-rwatson_sockref.diff

Earlier versions of the patch may be found in the same directory in similarly 
named files.  The working branch maintaining these changes may be found in 
Perforce at:

   //depot/user/rwatson/sockref/...

As a high level recap, the following classes of changes appear in this patch:

- The socket code now no longer relies on reading so_pcb as a hint regarding
   protocol behavior and shutdown.  This eliminates a number of races, and
   means that only the protocol is responsible for reading/maintaining the
   field, and can synchronize it as desired.

- All protocols converted to maintain the invariant that so_pcb will be
   non-NULL and point to a valid PCB at all times while the socket is in valid.
   Depending on the protocol, this change either removed a number of crashes
   and races, or eliminated heavy-weight locking to maintain the validity of
   so_pcb during use by the socket layer.

- In some cases, this required significant rewriting of state management --
   specifically, for IPX/SPX and TCP/IP.  SPX and TCP now maintain DROPPED
   flags on their inpcb's to reflect the state previously identified through a
   NULL so_pcb pointer.

- Protocols can now explicitly request that a socket not be freed on last
   consumer reference, using the SS_PROTOREF flag, in order that they can
   continue to access the socket buffer until it is no longer required.  I.e.,
   TCP after socket close() but before final ACKs from the remote endpoint for
   sent data.  sotryfree() is eliminated.  TCP has gained an inpcb flag to
   reflect this condition.

- Improved documentation of kernel socket API calls, which will be followed
   with man pages once things are hammered out a bit more.

- fgetsock() and fputsock() are deprecated, with long-term plans to eliminate
   the use of soref() and sorele() for consumer use.  Consumers now receive a
   reference to a socket using socreate(), and release it using soclose(), in
   order to avoid use of sockets after close.  Consumer reference counts, such
   as file descriptor reference counts, should be used in preference, as this
   offers cleaner behavior at the socket layer, and also avoids additional
   mutex operations.  Some consumer still remain, but have been annotated.

- pru_abort, pru_detach are now no longer allowed to fail.  Garbage collection
   of the socket after these, assuming SS_PROTOREF isn't set, is unconditional,
   and not a property of the error value returned.

- Protocols now only call sofree() if they have claimed SS_PROTOREF.  They
   don't attempt to spontaneously free sockets in numerous situations in the
   hopes of not leaking it, since socket teardown is now well-defined.

The following protocols are updated, tested, and believed to work in the new 
world order:

   uipc_usrreq
   net (raw, routing)
   netinet
   netinet6
   netipx
   netatalk

The following protocols are updated for the new world order, but not tested:

   netnatm
   ng_socket
   netipsec
   netinet6/ipsec
   netkey

The following protocols are not updated for the new world order, but the 
maintainer is aware of these changes and plans to updated the protocol in the 
immediate future:

   ng_btsocket

The following protocols are not updated for the new world order, and do not 
have a maintainer:

     netatm

I will commit the changes to make netatm compile, but am pretty sure there 
will be socket reference problems.  Please see posts on arch@ on this topic 
for more information.

As with all significant kernel changes, these changes likely include 
significant bugs, which you, the -current user, will have the opportunity to 
help me find.  I will attempt to respond as quickly as I can, although 
debugging complex network stack issues can, of course, be tricky and take a 
bit.  Hopefully these changes will, in the long term, improve both the 
stability and performance of the FreeBSD stack, by sanitizing and sanifying 
otherwise obscure and often broken behavior, and eliminating several subtle 
types of race conditions that may have been responsible for occasional network 
instability reported in RELENG_5 and RELENG_6 (and in some cases, RELENG_4). 
I do expect the ride to initially be bumpy though.

Robert N M Watson



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060329115028.C19236>