From owner-freebsd-hackers Fri Jul 19 18: 3:30 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6F7AC37B400; Fri, 19 Jul 2002 18:03:07 -0700 (PDT) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id EB6F543E67; Fri, 19 Jul 2002 18:03:06 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.12.5/8.12.4) with ESMTP id g6K135CV081156; Fri, 19 Jul 2002 18:03:05 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.5/8.12.4/Submit) id g6K135Ap081155; Fri, 19 Jul 2002 18:03:05 -0700 (PDT) (envelope-from dillon) Date: Fri, 19 Jul 2002 18:03:05 -0700 (PDT) From: Matthew Dillon Message-Id: <200207200103.g6K135Ap081155@apollo.backplane.com> To: freebsd-hackers@freebsd.org, freebsd-net@freebsd.org Subject: Another go at bandwidth delay product pipeline limiting for TCP Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Ok, I am having another go at trying to implement a bandwidth delay product calculation to limit the number of inflight packets. The idea behind this feature is two fold: (1) If you have huge TCP buffers and there is no packet loss our TCP stack will happily build up potentially hundreds of outgoing packets even though most of them just sit in the interface queue (or, worse, in your router's interface queue!). (2) If you have a bandwidth constriction, such as a modem, this feature attempts to place only as many packets in the pipeline as is necessary to fill the pipeline, which means that you can type in one window and send large amounts of data (scp, ftp) in another. Note that this is a transmitter-side solution, not a receiver-side solution. This will not help your typing if you are downloading a lot of stuff and the remote end builds up a lot of packets on your ISP's router. Theoretically we should be able to also restrict the window we advertise but that is a much more difficult problem. This code is highly experimental and so the SYSCTL's are setup for debugging (and it is disabled by default). I'm sure a lot of tuning can be done. The sysctl's are as follows: net.inet.tcp.inflight_enable default off (0) net.inet.tcp.inflight_debug default on (1) net.inet.tcp.inflight_min default 1024 net.inet.tcp.inflight_max default seriously large number Under normal operating conditions the min default would usually be at least 4096. For debugging it is useful to allow it to be 1024. Note that the code will not internally allow the inflight size to drop under 2 * maxseg (two segments). This code calculates the bandwidth delay product and artifically closes the transmit window to that value. The bandwidth delay product for the purposes of transmit window calculation is: bytes_in_flight = end_to_end_bandwidth * srtt Examples: Transport Bandwidth Ping Bandwidth Delay product (-s 1440) GigE 100 MBytes/sec 1.00 ms 100000 bytes 100BaseTX 10 MBytes/sec 0.65 ms 6500 bytes 10BaseT 1 MByte/sec 1.00 ms 1000 bytes T1 170 KBytes/sec 5.00 ms 850 bytes DSL 120 KBytes/sec 20.00 ms 2400 bytes ISDN 14 KBytes/sec 40.00 ms 560 bytes 56K modem 5.6 KBytes/sec 120 ms 672 bytes Slow client 50 KBytes/sec 200 ms 10000 bytes Now lets say you have a TCP send buffer of 128K and the remote end has a receive buffer of 128K, and window scaling works. On a 100BaseTX connection with no packet loss your TCP sender will queue up to 91 packets to the interface even though it only really needs to queue up 5 packets. With net.inet.tcp.inflight_enable turned on, the TCP sender will only queue up 4 packets. On the GigE link which actually needs 69 packets in flight, 69 packets will be queued up. That's what this code is supposed to do. This is my second attempt. I tried this last year too but it was too messy. But this time I think I've got it down to where it isn't as messy. -Matt Matthew Dillon Index: tcp_input.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_input.c,v retrieving revision 1.165 diff -u -r1.165 tcp_input.c --- tcp_input.c 19 Jul 2002 18:27:39 -0000 1.165 +++ tcp_input.c 20 Jul 2002 00:38:15 -0000 @@ -1008,6 +1008,8 @@ else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq)) tcp_xmit_timer(tp, ticks - tp->t_rtttime); + tcp_xmit_bandwidth_limit(tp, th->th_ack); + acked = th->th_ack - tp->snd_una; tcpstat.tcps_rcvackpack++; tcpstat.tcps_rcvackbyte += acked; @@ -1805,6 +1807,8 @@ tcp_xmit_timer(tp, ticks - to.to_tsecr + 1); else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq)) tcp_xmit_timer(tp, ticks - tp->t_rtttime); + + tcp_xmit_bandwidth_limit(tp, th->th_ack); /* * If all outstanding data is acked, stop retransmit Index: tcp_output.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_output.c,v retrieving revision 1.65 diff -u -r1.65 tcp_output.c --- tcp_output.c 23 Jun 2002 21:25:36 -0000 1.65 +++ tcp_output.c 20 Jul 2002 00:38:15 -0000 @@ -164,6 +164,7 @@ sendalot = 0; off = tp->snd_nxt - tp->snd_una; win = min(tp->snd_wnd, tp->snd_cwnd); + win = min(win, tp->snd_bwnd); flags = tcp_outflags[tp->t_state]; /* @@ -773,7 +774,8 @@ tp->snd_max = tp->snd_nxt; /* * Time this transmission if not a retransmission and - * not currently timing anything. + * not currently timing anything. Also calculate + * the bandwidth (8 segment average) */ if (tp->t_rtttime == 0) { tp->t_rtttime = ticks; Index: tcp_subr.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_subr.c,v retrieving revision 1.137 diff -u -r1.137 tcp_subr.c --- tcp_subr.c 18 Jul 2002 19:06:12 -0000 1.137 +++ tcp_subr.c 20 Jul 2002 00:38:16 -0000 @@ -144,6 +144,22 @@ SYSCTL_INT(_net_inet_tcp, OID_AUTO, isn_reseed_interval, CTLFLAG_RW, &tcp_isn_reseed_interval, 0, "Seconds between reseeding of ISN secret"); +static int tcp_inflight_enable = 0; +SYSCTL_INT(_net_inet_tcp, OID_AUTO, inflight_enable, CTLFLAG_RW, + &tcp_inflight_enable, 0, "Enable automatic TCP inflight data limiting"); + +static int tcp_inflight_debug = 1; +SYSCTL_INT(_net_inet_tcp, OID_AUTO, inflight_debug, CTLFLAG_RW, + &tcp_inflight_debug, 0, "Debug TCP inflight calculations"); + +static int tcp_inflight_min = 1024; +SYSCTL_INT(_net_inet_tcp, OID_AUTO, inflight_min, CTLFLAG_RW, + &tcp_inflight_min, 0, "Lower-bound for TCP inflight window"); + +static int tcp_inflight_max = TCP_MAXWIN << TCP_MAX_WINSHIFT; +SYSCTL_INT(_net_inet_tcp, OID_AUTO, inflight_max, CTLFLAG_RW, + &tcp_inflight_max, 0, "Upper-bound for TCP inflight window"); + static void tcp_cleartaocache(void); static struct inpcb *tcp_notify(struct inpcb *, int); @@ -547,6 +563,7 @@ tp->t_rttmin = tcp_rexmit_min; tp->t_rxtcur = TCPTV_RTOBASE; tp->snd_cwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT; + tp->snd_bwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT; tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT; tp->t_rcvtime = ticks; /* @@ -1509,3 +1526,129 @@ tcp_cleartaocache() { } + +/* + * Calculate the bandwidth based on received acks every 8 + * maximal segments and smooth the result. + * + * The nominal snd_bwnd calculation is (bandwidth * rtt), + * the amount of data required to keep the network pipe + * full. However, we cannot simply make this calculation + * because our adjustment of snd_bwnd based on it will + * be highly unstable, producing positive feedback if we are + * too low and also producing positive feedback if we are + * too high. + * + * In order to stabilize the calculation we have to increase + * bwnd a little, measure the bandwidth, then decrease bwnd + * a little and measure the rtt. The resulting calculation + * will should then be stable. + */ +void +tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq) +{ + u_long bw; + + /* + * If inflight_enable is disabled in the middle of a tcp connection, + * make sure snd_bwnd is effectively disabled. + */ + if (tcp_inflight_enable == 0) { + tp->snd_bwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT; + tp->snd_bandwidth = 0; + } + + /* + * Base periodic is once every 8 maximal segments. + */ + if (tcp_inflight_enable == 0 || + (int)(ack_seq - tp->t_bw_rtseq) < tp->t_maxseg * 8 || + tp->t_bw_rtttime == ticks) { + return; + } + + /* + * Calculate the bandwidth + */ + if (tp->t_bw_rtttime) { + bw = (ack_seq - tp->t_bw_rtseq) * hz / + (ticks - tp->t_bw_rtttime); + } else { + bw = tp->snd_bandwidth; + } + tp->t_bw_rtseq = ack_seq; + tp->t_bw_rtttime = ticks; + if (tp->snd_bandwidth == 0) + tp->snd_bandwidth = bw; + else + tp->snd_bandwidth = (tp->snd_bandwidth * 3 + bw) >> 2; + + /* + * Initial Conditions + */ + if (bw && tp->snd_bwnd == TCP_MAXWIN << TCP_MAX_WINSHIFT) { + tp->snd_bwnd = (u_int64_t)tp->snd_bandwidth * tp->t_srtt / + (hz << TCP_RTT_SHIFT); + } + + /* + * calculate the bandwidth delay product and cycle through + * our state machine. + */ + ++tp->t_bw_state; + + switch(tp->t_bw_state & 0x0F) { + case 0x00: + /* + * Save the bandwidth and increase bwnd. + */ + tp->t_bw_bandwidth = tp->snd_bandwidth; + tp->snd_bwnd += tp->t_maxseg; + break; + case 0x04: + /* + * If the bandwidth does not go up by at least maxseg / 4, + * cycle back to neutral. + */ + if (tp->snd_bandwidth <= tp->t_bw_bandwidth + tp->t_maxseg / 4) + tp->snd_bwnd -= tp->t_maxseg; + break; + case 0x08: + /* + * Save the bandwidth and decrease bwnd. + */ + tp->t_bw_bandwidth = tp->snd_bandwidth; + tp->snd_bwnd -= tp->t_maxseg; + break; + case 0x0C: + /* + * If the bandwidth goes down by more then maxseg / 4, + * cycle back to neutral. Otherwise keep the change. + * + * Note: in the bwnd-too-high case the bandwidth does not + * usually change much so we tend to keep the change, + * which means we tend to decrease bwnd. This stabilizes + * the algorithm. + */ + if (tp->snd_bandwidth <= tp->t_bw_bandwidth - tp->t_maxseg / 4) + tp->snd_bwnd += tp->t_maxseg; + break; + default: + break; /* no action */ + } + if (tcp_inflight_debug) { + static int ltick; + if ((unsigned int)(ticks - ltick) > hz) { + printf("BW %ld (%ld) BWND %d srtt %d\n", + tp->snd_bandwidth, bw, tp->snd_bwnd, tp->t_srtt); + ltick = ticks; + } + } + if (tp->snd_bwnd < tcp_inflight_min) + tp->snd_bwnd = tcp_inflight_min; + if (tp->snd_bwnd < tp->t_maxseg * 2) + tp->snd_bwnd = tp->t_maxseg * 2; + if (tp->snd_bwnd > tcp_inflight_max) + tp->snd_bwnd = tcp_inflight_max; +} + Index: tcp_usrreq.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_usrreq.c,v retrieving revision 1.76 diff -u -r1.76 tcp_usrreq.c --- tcp_usrreq.c 13 Jun 2002 23:14:58 -0000 1.76 +++ tcp_usrreq.c 20 Jul 2002 00:38:16 -0000 @@ -875,6 +875,7 @@ tp->t_state = TCPS_SYN_SENT; callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp); tp->iss = tcp_new_isn(tp); + tp->t_bw_rtseq = tp->iss; tcp_sendseqinit(tp); /* @@ -961,6 +962,7 @@ tp->t_state = TCPS_SYN_SENT; callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp); tp->iss = tcp_new_isn(tp); + tp->t_bw_rtseq = tp->iss; tcp_sendseqinit(tp); /* Index: tcp_var.h =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_var.h,v retrieving revision 1.82 diff -u -r1.82 tcp_var.h --- tcp_var.h 19 Jul 2002 18:27:39 -0000 1.82 +++ tcp_var.h 20 Jul 2002 00:38:17 -0000 @@ -124,10 +124,12 @@ u_long snd_wnd; /* send window */ u_long snd_cwnd; /* congestion-controlled window */ + u_long snd_bwnd; /* bandwidth-controlled window */ u_long snd_ssthresh; /* snd_cwnd size threshold for * for slow start exponential to * linear switch */ + u_long snd_bandwidth; /* calculated bandwidth or 0 */ tcp_seq snd_recover; /* for use in fast recovery */ u_int t_maxopd; /* mss plus options */ @@ -137,6 +139,11 @@ int t_rtttime; /* round trip time */ tcp_seq t_rtseq; /* sequence number being timed */ + int t_bw_rtttime; /* used for bandwidth calculation */ + tcp_seq t_bw_rtseq; /* used for bandwidth calculation */ + int t_bw_state; /* used for snd_bwnd calculation */ + u_long t_bw_bandwidth; /* used for snd_bwnd calculation */ + int t_rxtcur; /* current retransmit value (ticks) */ u_int t_maxseg; /* maximum segment size */ int t_srtt; /* smoothed round-trip time */ @@ -473,6 +480,7 @@ struct tcpcb * tcp_timers(struct tcpcb *, int); void tcp_trace(int, int, struct tcpcb *, void *, struct tcphdr *, int); +void tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq); void syncache_init(void); void syncache_unreach(struct in_conninfo *, struct tcphdr *); int syncache_expand(struct in_conninfo *, struct tcphdr *, To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message