Date: Wed, 29 Jan 2014 22:56:29 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: FreeBSD Net <freebsd-net@freebsd.org> Subject: 64K NFS I/O generates a 34mbuf list for TCP which breaks TSO Message-ID: <24918548.18766184.1391054189890.JavaMail.root@uoguelph.ca>
next in thread | raw e-mail | index | archive | help
For some time, I've been seeing reports of NFS related issues that get resolved by the user either disabling TSO or reducing the rsize/wsize to 32K. I now think I know why this is happening, although the evidence is just coming in. (I have no hardware/software that does TSO, so I never see these problems during testing.) A 64K NFS read reply, readdir reply or write request results in the krpc handing the TCP socket an mbuf list with 34 entries via sosend(). Now, I am really rusty w.r.t. TCP, but it looks like this will result in a TCP/IP header + 34 data mbufs being handed to the network device driver, if if_hw_tsomax has the default setting of 65535 (max IP datagram). At a glance, many drivers use a scatter/gather list of around 32 elements for transmission. If the mbuf list doesn't fit in this scatter/gather list (which looks to me like it will be the case), then the driver either calls m_defrag() or m_collapse() to try and fix the problem. This seems like a serious problem to me. 1 - If m_collapse()/m_defrag() fails, the transmit doesn't happen and things wedge until a TCP timeout retransmit gets things going again. It looks like m_defrag() is less likely to fail, but generates a lot of overhead. m_collapse() seems to be less overhead, but seems less likely to succeed. (Since m_defrag() is called with M_NOWAIT, it can fail in that extreme case. I'm not sure if it will fail otherwise?) So, how to fix this? 1 - Change NFS to use 4K clusters for these 64K reads/writes, reducing the mbuf list from 34->18. Preliminary patches for this are being tested. --> However, this seems to be more of a work-around than a fix. 2 - As soon as a driver needs to call m_defrag() or m_collapse() because the length of the TSO transmit mbuf list is too long, reduce if_hw_tsomax by a significant amount to try and get tcp_output() to generate shorter mbuf lists. Not great, but at least better than calling m_defrag()/m_collapse() over and over and over again. --> As a starting point, instrumenting the device drivers so that counts of # ofcalls to m_defrag()/m_collapse() and counts of failed calls would help to confirm how serious this problem is. 3 - ??? Any ideas from folk familiar with TSO and these drivers. rick ps: Until this gets resolved, please tell anyone with serious NFS performance/reliability issues to try either disabling TSO or doing client mounts with "-o rsize=32768,wsize=32768". I'm not sure how many believe me when I tell them, but at least I now have a theory as to why it can help a lot.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?24918548.18766184.1391054189890.JavaMail.root>