From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 06:18:34 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CF903D91 for ; Fri, 31 Jan 2014 06:18:34 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9ED5D1E26 for ; Fri, 31 Jan 2014 06:18:34 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0V6IVEN027168; Fri, 31 Jan 2014 01:18:31 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0V6IVJv027167; Fri, 31 Jan 2014 01:18:31 -0500 (EST) (envelope-from wollman) Date: Fri, 31 Jan 2014 01:18:31 -0500 (EST) Message-Id: <201401310618.s0V6IVJv027167@hergotha.csail.mit.edu> From: wollman@freebsd.org To: j.david.lists@gmail.com Subject: Re: Terrible NFS performance under 9.2-RELEASE? X-Newsgroups: mit.lcs.mail.freebsd-net In-Reply-To: References: <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca> Organization: none X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Fri, 31 Jan 2014 01:18:31 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 06:18:35 -0000 In article , J David writes: >The process of TCP segmentation, whether offloaded or not, is >performed on a single TCP packet. It operates by reusing that >packet's header over and over for each segment with slight >modifications. Consequently the maximum size that can be offloaded is >the maximum size that can be segmented: one packet. This is almost entirely wrong in its description of the non-offload case. A segment is a PDU at the transport layer. In normal operation, TCP figures out how much it can send, constructs a header, and copies an mbuf chain referencing one segment's worth of data out of the socket's transmit buffer. tcp_output() repeats this process (possibly using the same mbuf cluster multiple times, if it's larger than the receiver's or the path's maximum segment size) until it either runs out of stuff to send, or runs out of transmit window to send into. THAT IS WHY TSO IS A WIN: as you describe, the packet headers are mostly identical, and (if the transmit window allows) it's much cheaper to build the header and do the DMA setup once, then let the NIC take over from there, rather than having to DMA a different (but nearly identical) header for every individual segment. >NFS is not sending packets to the TCP stack, it is sending stream >data. With TCP_NODELAY it should be possible to engineer a one send = >one packet correlation, but that's true if and only if that send is >less than the max packet size. Yes and no. NFS constructs a chain of mbufs and calls the socket's sosend() routine. This ultimately results in a call to tcp_output(), and in the normal case where there is no data awaiting transmission, that mbuf chain will be shallow-copied (bumping all the mbuf cluster reference counts) up to the limit of what the transmit window allows, and Ethernet, IP, and TCP headers will be prepended (possibly in a separate mbuf). The whole mess is then passed on to the hardware for offload, if it fits. RPC responses will only get smushed together if tcp_output() wasn't able to schedule the transmit immediately, and if the network is working properly, that will only happen if there's more than one client-side-receive-window's-worth of data to be transmitted. This shallow-copy behavior, by the way, is why the drivers need m_defrag() rather than m_collapse(): M_WRITABLE is never true for clusters coming out of tcp_output(), because the refcount will never be less than 2 (one for the socket buffer and at least one for the interface's transmit queue, depending on how many segments include some data from the cluster). But it's also part of why having a "gigantic" cluster (e.g., 128k) would be a big win for NFS. -GAWollman