From owner-freebsd-current@FreeBSD.ORG Wed Mar 29 12:05:52 2006 Return-Path: X-Original-To: current@FreeBSD.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DDF2416A420 for ; Wed, 29 Mar 2006 12:05:52 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5476643D45 for ; Wed, 29 Mar 2006 12:05:52 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id C402746C5E; Wed, 29 Mar 2006 07:05:51 -0500 (EST) Date: Wed, 29 Mar 2006 12:05:51 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: current@FreeBSD.org In-Reply-To: <20060329100839.V19236@fledge.watson.org> Message-ID: <20060329115028.C19236@fledge.watson.org> References: <20060317141627.W2181@fledge.watson.org> <20060329100839.V19236@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Randall Stewart Subject: Re: REMINDER: Re: HEADS UP: network stack and socket hackery over the next few weeks X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Mar 2006 12:05:52 -0000 On Wed, 29 Mar 2006, Robert Watson wrote: > As a reminder, April 1 is now three days away. On April 1, I will be > committed an extensive set of socket and netinet changes which will likely > render the network stack broken. I say this with some confidence because I > have tested the changes fairly extensively, as have a number of other > developers, and they appear to mostly work. Therefore, they will be broken > :-). I will be posting updated versions of these patches shortly, but > unless we run into show-stopper serious instability with them, rather than > nits, I will commit them (in their updated form) on April 1 shortly after > the netatm build is disabled. > > I will post another HEADS UP as the changes go into the tree, and will be > monitoring things closely to try and get any bugs that might turn up fixed > as quickly as possible. As an FYI, I will be travelling the weeks of April > 6 - April 21, but will be online frequently, and working for several days in > the Bay Area during the trip. Please report bugs relating to this work to > current@. An updated version of the patch is now available for download at: http://www.watson.org/~robert/freebsd/netperf/20060329-rwatson_sockref.diff Earlier versions of the patch may be found in the same directory in similarly named files. The working branch maintaining these changes may be found in Perforce at: //depot/user/rwatson/sockref/... As a high level recap, the following classes of changes appear in this patch: - The socket code now no longer relies on reading so_pcb as a hint regarding protocol behavior and shutdown. This eliminates a number of races, and means that only the protocol is responsible for reading/maintaining the field, and can synchronize it as desired. - All protocols converted to maintain the invariant that so_pcb will be non-NULL and point to a valid PCB at all times while the socket is in valid. Depending on the protocol, this change either removed a number of crashes and races, or eliminated heavy-weight locking to maintain the validity of so_pcb during use by the socket layer. - In some cases, this required significant rewriting of state management -- specifically, for IPX/SPX and TCP/IP. SPX and TCP now maintain DROPPED flags on their inpcb's to reflect the state previously identified through a NULL so_pcb pointer. - Protocols can now explicitly request that a socket not be freed on last consumer reference, using the SS_PROTOREF flag, in order that they can continue to access the socket buffer until it is no longer required. I.e., TCP after socket close() but before final ACKs from the remote endpoint for sent data. sotryfree() is eliminated. TCP has gained an inpcb flag to reflect this condition. - Improved documentation of kernel socket API calls, which will be followed with man pages once things are hammered out a bit more. - fgetsock() and fputsock() are deprecated, with long-term plans to eliminate the use of soref() and sorele() for consumer use. Consumers now receive a reference to a socket using socreate(), and release it using soclose(), in order to avoid use of sockets after close. Consumer reference counts, such as file descriptor reference counts, should be used in preference, as this offers cleaner behavior at the socket layer, and also avoids additional mutex operations. Some consumer still remain, but have been annotated. - pru_abort, pru_detach are now no longer allowed to fail. Garbage collection of the socket after these, assuming SS_PROTOREF isn't set, is unconditional, and not a property of the error value returned. - Protocols now only call sofree() if they have claimed SS_PROTOREF. They don't attempt to spontaneously free sockets in numerous situations in the hopes of not leaking it, since socket teardown is now well-defined. The following protocols are updated, tested, and believed to work in the new world order: uipc_usrreq net (raw, routing) netinet netinet6 netipx netatalk The following protocols are updated for the new world order, but not tested: netnatm ng_socket netipsec netinet6/ipsec netkey The following protocols are not updated for the new world order, but the maintainer is aware of these changes and plans to updated the protocol in the immediate future: ng_btsocket The following protocols are not updated for the new world order, and do not have a maintainer: netatm I will commit the changes to make netatm compile, but am pretty sure there will be socket reference problems. Please see posts on arch@ on this topic for more information. As with all significant kernel changes, these changes likely include significant bugs, which you, the -current user, will have the opportunity to help me find. I will attempt to respond as quickly as I can, although debugging complex network stack issues can, of course, be tricky and take a bit. Hopefully these changes will, in the long term, improve both the stability and performance of the FreeBSD stack, by sanitizing and sanifying otherwise obscure and often broken behavior, and eliminating several subtle types of race conditions that may have been responsible for occasional network instability reported in RELENG_5 and RELENG_6 (and in some cases, RELENG_4). I do expect the ride to initially be bumpy though. Robert N M Watson