From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 03:56:31 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 27A70B03 for ; Thu, 30 Jan 2014 03:56:31 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id E23921DA2 for ; Thu, 30 Jan 2014 03:56:30 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAM7M6VKDaFve/2dsb2JhbABZhBuDAboRCYEadIJPBIEHAg0ZAl+IGJthjxGgcReBKY0igyqBSQSJSaB+g0segW4 X-IronPort-AV: E=Sophos;i="4.95,746,1384318800"; d="scan'208";a="91744702" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 22:56:29 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id DBB80B4063 for ; Wed, 29 Jan 2014 22:56:29 -0500 (EST) Date: Wed, 29 Jan 2014 22:56:29 -0500 (EST) From: Rick Macklem To: FreeBSD Net Message-ID: <24918548.18766184.1391054189890.JavaMail.root@uoguelph.ca> Subject: 64K NFS I/O generates a 34mbuf list for TCP which breaks TSO MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 03:56:31 -0000 For some time, I've been seeing reports of NFS related issues that get resolved by the user either disabling TSO or reducing the rsize/wsize to 32K. I now think I know why this is happening, although the evidence is just coming in. (I have no hardware/software that does TSO, so I never see these problems during testing.) A 64K NFS read reply, readdir reply or write request results in the krpc handing the TCP socket an mbuf list with 34 entries via sosend(). Now, I am really rusty w.r.t. TCP, but it looks like this will result in a TCP/IP header + 34 data mbufs being handed to the network device driver, if if_hw_tsomax has the default setting of 65535 (max IP datagram). At a glance, many drivers use a scatter/gather list of around 32 elements for transmission. If the mbuf list doesn't fit in this scatter/gather list (which looks to me like it will be the case), then the driver either calls m_defrag() or m_collapse() to try and fix the problem. This seems like a serious problem to me. 1 - If m_collapse()/m_defrag() fails, the transmit doesn't happen and things wedge until a TCP timeout retransmit gets things going again. It looks like m_defrag() is less likely to fail, but generates a lot of overhead. m_collapse() seems to be less overhead, but seems less likely to succeed. (Since m_defrag() is called with M_NOWAIT, it can fail in that extreme case. I'm not sure if it will fail otherwise?) So, how to fix this? 1 - Change NFS to use 4K clusters for these 64K reads/writes, reducing the mbuf list from 34->18. Preliminary patches for this are being tested. --> However, this seems to be more of a work-around than a fix. 2 - As soon as a driver needs to call m_defrag() or m_collapse() because the length of the TSO transmit mbuf list is too long, reduce if_hw_tsomax by a significant amount to try and get tcp_output() to generate shorter mbuf lists. Not great, but at least better than calling m_defrag()/m_collapse() over and over and over again. --> As a starting point, instrumenting the device drivers so that counts of # ofcalls to m_defrag()/m_collapse() and counts of failed calls would help to confirm how serious this problem is. 3 - ??? Any ideas from folk familiar with TSO and these drivers. rick ps: Until this gets resolved, please tell anyone with serious NFS performance/reliability issues to try either disabling TSO or doing client mounts with "-o rsize=32768,wsize=32768". I'm not sure how many believe me when I tell them, but at least I now have a theory as to why it can help a lot.