From owner-freebsd-bugs Mon Mar 16 08:10:13 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id IAA10495 for freebsd-bugs-outgoing; Mon, 16 Mar 1998 08:10:13 -0800 (PST) (envelope-from owner-freebsd-bugs@FreeBSD.ORG) Received: (from gnats@localhost) by hub.freebsd.org (8.8.8/8.8.8) id IAA10458; Mon, 16 Mar 1998 08:10:04 -0800 (PST) (envelope-from gnats) Received: from brookfield.ans.net (brookfield-ef0.brookfield.ans.net [204.148.1.20]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id IAA09221 for ; Mon, 16 Mar 1998 08:05:19 -0800 (PST) (envelope-from curtis@brookfield.ans.net) Received: (from curtis@localhost) by brookfield.ans.net (8.8.5/8.8.5) id LAA02255; Mon, 16 Mar 1998 11:05:16 -0500 (EST) Message-Id: <199803161605.LAA02255@brookfield.ans.net> Date: Mon, 16 Mar 1998 11:05:16 -0500 (EST) From: Curtis Villamizar Reply-To: curtis@brookfield.ans.net To: FreeBSD-gnats-submit@FreeBSD.ORG X-Send-Pr-Version: 3.2 Subject: kern/6032: poor TCP performance using FDDI over long delay path Sender: owner-freebsd-bugs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org >Number: 6032 >Category: kern >Synopsis: poor TCP performance using FDDI over long delay path >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: change-request >Submitter-Id: current-users >Arrival-Date: Mon Mar 16 08:10:01 PST 1998 >Last-Modified: >Originator: curtis@ans.net >Organization: ANS Communications >Release: FreeBSD 2.2.5-RELEASE i386 >Environment: FreeBSD is being used here as a TCP load for performance testing of network equipment. The problems encountered here will be seen by high performance applications run over the Internet and over long delay paths such as satellite. The simplest setup is two PCs with DEC PCI FDDI cards and a network with a long RTT between them. [Note: this was run on 2.2.5 but the bug report was filled in on a 2.2.1 system in case send-pr reports 2.2.1.] >Description: Change window size to a large value (for excample 128KB). Expect to get 40-80 Mb/s. Instead FreeBSD yields about 1 MB/s. BSDI and other BSD or *ix flavors yields 40-80 Mb/s as expected. A bit part of the problem is the function tcp_mss in tcp_input.c which sets the window size back to a small value. >How-To-Repeat: Run ttcp or netperf over long delay path with 128KB window. Source for ttcp is freely available (source on request if you don't have it). >Fix: The email message below sums it up. I never did look into the reason why increasing MCLSHIFT would result in an unusable kernel so I'm sending this in as one bug report. If I get a chance I'll look at the MCLSHIFT problem and also try to figure out why setting NMBCLUSTERS to 2048 or above was a problem. You can do whatever you want with these patches. This is simply a performance issue. Subject: FreeBSD performance problem solved Date: Thu, 19 Feb 1998 22:23:07 -0500 From: Curtis Villamizar The FreeBSD performance problem we had run into previously has now been solved. It may not be the best way for the general FreeBSD audience but it is completely solved for our puposes. The executive summary is: - the kernel no longer resets the window size back to a small value for no apparent reason (see below) - we now can use just under a 1MB window (about the same as BSDI) - some kernel tuning (page buffer size, number of clusters) was done to make FDDI MTU work slightly faster - we get 20 Mb/s with 192 KB window and 70 msec RTT - we get 77 Mb/s with 896 KB window and 70 msec RTT (6.7 sec transfer) - we get 88 Mb/s with 896 KB window and 70 msec RTT (47 sec transfer) - we get 89 Mb/s with 896 KB window and 70 msec RTT (184 sec transfer) - these are slightly better than the BSDI figures (I think? Bill?) The 2GB transfer in just over 3 minutes is getting quite close to FDDI line rate. The gory details are listed below. I'll be sending separate bug reports to the FreeBSD team on the tcp_mss issue and the inability to change MCLSHIFT or increase NMBCLUSTERS to 2048. Curtis All the kernel stuff is in /sys which is really a symbolic link to /usr/src/sys. Some of the key directories are netinet where all the ip, udp, and tcp code is, kern where all the socket code is, vm where the virtual memory code is, and sys where system header files are. The main culprit was the function tcp_mss in tcp_input.c. This function is called when a TCP SYN or SYN ACK arrives. Its purpose in life is to adjust the initial MSS and when doing so also adjust the buffer size if appropriate. One of the new "features" of tcp_mss is that it now looks up the route that would be used for the socket return path and unconditionally reset the send and recv buffer size if there is a sendspace or recvspace parameter on the route even if the buffer sizes had been set by a setsockopt. When I found this in the code my first reaction was to not touch the source and just explicitly set the sendspace or recvspace on the route to 10/8. This effort was foiled by the fact that tcp_mss seems to have picked up the wrong route. I then decided to get rid of the problem for good and just change the code so it will only increase the buffer sizes according to the route, but never decrease them. The patch is: *** tcp_input.c.orig Thu Feb 19 21:56:49 1998 --- tcp_input.c Thu Feb 19 21:56:14 1998 *************** *** 2075,2080 **** --- 2075,2082 ---- if ((bufsize = rt->rt_rmx.rmx_sendpipe) == 0) #endif bufsize = so->so_snd.sb_hiwat; + if (bufsize < so->so_snd.sb_hiwat) + bufsize = so->so_snd.sb_hiwat; if (bufsize < mss) mss = bufsize; else { *************** *** 2089,2094 **** --- 2091,2098 ---- if ((bufsize = rt->rt_rmx.rmx_recvpipe) == 0) #endif bufsize = so->so_rcv.sb_hiwat; + if (bufsize < so->so_rcv.sb_hiwat) + bufsize = so->so_rcv.sb_hiwat; if (bufsize > mss) { bufsize = roundup(bufsize, mss); if (bufsize > sb_max) Another change is the change to SB_MAX (which can also be changed with sysctl). *** sys/socketvar.h.orig Thu Feb 19 22:00:24 1998 --- sys/socketvar.h Tue Feb 3 21:30:31 1998 *************** *** 90,96 **** short sb_flags; /* flags, see below */ short sb_timeo; /* timeout for read/write */ } so_rcv, so_snd; ! #define SB_MAX (256*1024) /* default for max chars in sockbuf */ #define SB_LOCK 0x01 /* lock on data queue */ #define SB_WANT 0x02 /* someone is waiting to lock */ #define SB_WAIT 0x04 /* someone is waiting for data/space */ --- 90,96 ---- short sb_flags; /* flags, see below */ short sb_timeo; /* timeout for read/write */ } so_rcv, so_snd; ! #define SB_MAX (1024*1024) /* default for max chars in sockbuf */ #define SB_LOCK 0x01 /* lock on data queue */ #define SB_WANT 0x02 /* someone is waiting to lock */ #define SB_WAIT 0x04 /* someone is waiting for data/space */ The change to the page size makes a full MTU packet fit within a page and allows the kernel code to do less copying. *** vm/vm_param.h.orig Thu Feb 19 22:00:38 1998 --- vm/vm_param.h Tue Feb 3 23:04:46 1998 *************** *** 77,83 **** * The machine independent pages are refered to as PAGES. A page * is some number of hardware pages, depending on the target machine. */ ! #define DEFAULT_PAGE_SIZE 4096 #if 0 --- 77,83 ---- * The machine independent pages are refered to as PAGES. A page * is some number of hardware pages, depending on the target machine. */ ! #define DEFAULT_PAGE_SIZE 8192 #if 0 One other thing that needs to be done is changing the total number of mbuf clusters allocated to the kernel. This can be done in the config file. Neither BSDI or FreeBSD would take a very large number for no apparent reason. I added the following to the config file for the testnet kernel (i386/conf/testnet-pc). options NMBCLUSTERS=1024 We could increase this to something over 1024. At the POC lab it would take 2048. This is sort of odd since that would have only been 4 MB dedicated to clusters on a 64 MB machine. This could be a magical power of two boundary for some other reason that I wasn't able to locate in the source code. I was never successful in increasing the cluster size from 2048 to 8192 (increase MCLSHIFT from 11 to 13). Again, there are dependencies on the relative size of some things in the kernel that aren't documented (and might be regarded as bugs). Increasing NMBCLUSTERS to 2048 or more or increasing MCLSHIFT from 11 to 13 will have to be exercises for a later date. These are tuning beyond what we really need. Fooling with these latter optimization gave us unusable kernels in the POC lab so I didn't want to play with this unless I was within walking distance of the reset button and had a console and keyboard. >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message