Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 19 Jul 2002 18:03:05 -0700 (PDT)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        freebsd-hackers@freebsd.org, freebsd-net@freebsd.org
Subject:   Another go at bandwidth delay product pipeline limiting for TCP
Message-ID:  <200207200103.g6K135Ap081155@apollo.backplane.com>

next in thread | raw e-mail | index | archive | help
    Ok, I am having another go at trying to implement a bandwidth
    delay product calculation to limit the number of inflight packets.

    The idea behind this feature is two fold:

    (1) If you have huge TCP buffers and there is no packet loss our
	TCP stack will happily build up potentially hundreds of outgoing
	packets even though most of them just sit in the interface queue
	(or, worse, in your router's interface queue!).

    (2) If you have a bandwidth constriction, such as a modem, this feature
	attempts to place only as many packets in the pipeline as is necessary
	to fill the pipeline, which means that you can type in one window
	and send large amounts of data (scp, ftp) in another.

    Note that this is a transmitter-side solution, not a receiver-side
    solution.  This will not help your typing if you are downloading a
    lot of stuff and the remote end builds up a lot of packets on your
    ISP's router.  Theoretically we should be able to also restrict the
    window we advertise but that is a much more difficult problem.

    This code is highly experimental and so the SYSCTL's are setup for
    debugging (and it is disabled by default).  I'm sure a lot of tuning can
    be done.  The sysctl's are as follows:

    net.inet.tcp.inflight_enable	default off (0)
    net.inet.tcp.inflight_debug		default on  (1)
    net.inet.tcp.inflight_min		default 1024
    net.inet.tcp.inflight_max		default seriously large number

    Under normal operating conditions the min default would usually be
    at least 4096.  For debugging it is useful to allow it to be 1024.
    Note that the code will not internally allow the inflight size to
    drop under 2 * maxseg (two segments).

    This code calculates the bandwidth delay product and artifically
    closes the transmit window to that value.  The bandwidth delay product
    for the purposes of transmit window calculation is:

	bytes_in_flight = end_to_end_bandwidth * srtt

    Examples:
	
     Transport		Bandwidth	Ping		Bandwidth Delay product
					(-s 1440)
    GigE		100 MBytes/sec  1.00 ms	      100000 bytes
    100BaseTX		10 MBytes/sec	0.65 ms		6500 bytes
    10BaseT		1  MByte/sec	1.00 ms		1000 bytes
    T1			170 KBytes/sec	5.00 ms		 850 bytes
    DSL			120 KBytes/sec	20.00 ms	2400 bytes
    ISDN		14 KBytes/sec   40.00 ms	 560 bytes
    56K modem		5.6 KBytes/sec	120 ms		 672 bytes
    Slow client		50 KBytes/sec	200 ms	       10000 bytes

    Now lets say you have a TCP send buffer of 128K and the remote end has a
    receive buffer of 128K, and window scaling works.  On a 100BaseTX
    connection with no packet loss your TCP sender will queue up to 
    91 packets to the interface even though it only really needs to queue
    up 5 packets.  With net.inet.tcp.inflight_enable turned on, the TCP
    sender will only queue up 4 packets.  On the GigE link which
    actually needs 69 packets in flight, 69 packets will be queued up.

    That's what this code is supposed to do.  This is my second attempt.
    I tried this last year too but it was too messy.  But this time I 
    think I've got it down to where it isn't as messy.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

Index: tcp_input.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_input.c,v
retrieving revision 1.165
diff -u -r1.165 tcp_input.c
--- tcp_input.c	19 Jul 2002 18:27:39 -0000	1.165
+++ tcp_input.c	20 Jul 2002 00:38:15 -0000
@@ -1008,6 +1008,8 @@
 				else if (tp->t_rtttime &&
 					    SEQ_GT(th->th_ack, tp->t_rtseq))
 					tcp_xmit_timer(tp, ticks - tp->t_rtttime);
+				tcp_xmit_bandwidth_limit(tp, th->th_ack);
+
 				acked = th->th_ack - tp->snd_una;
 				tcpstat.tcps_rcvackpack++;
 				tcpstat.tcps_rcvackbyte += acked;
@@ -1805,6 +1807,8 @@
 			tcp_xmit_timer(tp, ticks - to.to_tsecr + 1);
 		else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
 			tcp_xmit_timer(tp, ticks - tp->t_rtttime);
+
+		tcp_xmit_bandwidth_limit(tp, th->th_ack);
 
 		/*
 		 * If all outstanding data is acked, stop retransmit
Index: tcp_output.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_output.c,v
retrieving revision 1.65
diff -u -r1.65 tcp_output.c
--- tcp_output.c	23 Jun 2002 21:25:36 -0000	1.65
+++ tcp_output.c	20 Jul 2002 00:38:15 -0000
@@ -164,6 +164,7 @@
 	sendalot = 0;
 	off = tp->snd_nxt - tp->snd_una;
 	win = min(tp->snd_wnd, tp->snd_cwnd);
+	win = min(win, tp->snd_bwnd);
 
 	flags = tcp_outflags[tp->t_state];
 	/*
@@ -773,7 +774,8 @@
 			tp->snd_max = tp->snd_nxt;
 			/*
 			 * Time this transmission if not a retransmission and
-			 * not currently timing anything.
+			 * not currently timing anything.  Also calculate
+			 * the bandwidth (8 segment average)
 			 */
 			if (tp->t_rtttime == 0) {
 				tp->t_rtttime = ticks;
Index: tcp_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_subr.c,v
retrieving revision 1.137
diff -u -r1.137 tcp_subr.c
--- tcp_subr.c	18 Jul 2002 19:06:12 -0000	1.137
+++ tcp_subr.c	20 Jul 2002 00:38:16 -0000
@@ -144,6 +144,22 @@
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, isn_reseed_interval, CTLFLAG_RW,
     &tcp_isn_reseed_interval, 0, "Seconds between reseeding of ISN secret");
 
+static int	tcp_inflight_enable = 0;
+SYSCTL_INT(_net_inet_tcp, OID_AUTO, inflight_enable, CTLFLAG_RW,
+    &tcp_inflight_enable, 0, "Enable automatic TCP inflight data limiting");
+
+static int	tcp_inflight_debug = 1;
+SYSCTL_INT(_net_inet_tcp, OID_AUTO, inflight_debug, CTLFLAG_RW,
+    &tcp_inflight_debug, 0, "Debug TCP inflight calculations");
+
+static int	tcp_inflight_min = 1024;
+SYSCTL_INT(_net_inet_tcp, OID_AUTO, inflight_min, CTLFLAG_RW,
+    &tcp_inflight_min, 0, "Lower-bound for TCP inflight window");
+
+static int	tcp_inflight_max = TCP_MAXWIN << TCP_MAX_WINSHIFT;
+SYSCTL_INT(_net_inet_tcp, OID_AUTO, inflight_max, CTLFLAG_RW,
+    &tcp_inflight_max, 0, "Upper-bound for TCP inflight window");
+
 static void	tcp_cleartaocache(void);
 static struct inpcb *tcp_notify(struct inpcb *, int);
 
@@ -547,6 +563,7 @@
 	tp->t_rttmin = tcp_rexmit_min;
 	tp->t_rxtcur = TCPTV_RTOBASE;
 	tp->snd_cwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
+	tp->snd_bwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->t_rcvtime = ticks;
         /*
@@ -1509,3 +1526,129 @@
 tcp_cleartaocache()
 {
 }
+
+/*
+ * Calculate the bandwidth based on received acks every 8
+ * maximal segments and smooth the result.
+ *
+ * The nominal snd_bwnd calculation is (bandwidth * rtt),
+ * the amount of data required to keep the network pipe
+ * full.  However, we cannot simply make this calculation
+ * because our adjustment of snd_bwnd based on it will
+ * be highly unstable, producing positive feedback if we are
+ * too low and also producing positive feedback if we are
+ * too high.
+ *
+ * In order to stabilize the calculation we have to increase
+ * bwnd a little, measure the bandwidth, then decrease bwnd
+ * a little and measure the rtt.  The resulting calculation
+ * will should then be stable.
+ */
+void
+tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq)
+{
+	u_long bw;
+
+	/*
+	 * If inflight_enable is disabled in the middle of a tcp connection,
+	 * make sure snd_bwnd is effectively disabled.
+	 */
+	if (tcp_inflight_enable == 0) {
+	    tp->snd_bwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
+	    tp->snd_bandwidth = 0;
+	}
+
+	/*
+	 * Base periodic is once every 8 maximal segments.
+	 */
+	if (tcp_inflight_enable == 0 ||
+	    (int)(ack_seq - tp->t_bw_rtseq) < tp->t_maxseg * 8 ||
+	    tp->t_bw_rtttime == ticks) {
+		return;
+	}
+
+	/*
+	 * Calculate the bandwidth
+	 */
+	if (tp->t_bw_rtttime) {
+		bw = (ack_seq - tp->t_bw_rtseq) * hz /
+		    (ticks - tp->t_bw_rtttime);
+	} else {
+		bw = tp->snd_bandwidth;
+	}
+	tp->t_bw_rtseq = ack_seq;
+	tp->t_bw_rtttime = ticks;
+	if (tp->snd_bandwidth == 0)
+		tp->snd_bandwidth = bw;
+	else
+		tp->snd_bandwidth = (tp->snd_bandwidth * 3 + bw) >> 2;
+
+	/*
+	 * Initial Conditions
+	 */
+	if (bw && tp->snd_bwnd == TCP_MAXWIN << TCP_MAX_WINSHIFT) {
+		tp->snd_bwnd = (u_int64_t)tp->snd_bandwidth * tp->t_srtt /
+			(hz << TCP_RTT_SHIFT);
+	}
+
+	/*
+	 * calculate the bandwidth delay product and cycle through
+	 * our state machine.
+	 */
+	++tp->t_bw_state;
+
+	switch(tp->t_bw_state & 0x0F) {
+	case 0x00:
+		/*
+		 * Save the bandwidth and increase bwnd.
+		 */
+		tp->t_bw_bandwidth = tp->snd_bandwidth;
+		tp->snd_bwnd += tp->t_maxseg;
+		break;
+	case 0x04:
+		/*
+		 * If the bandwidth does not go up by at least maxseg / 4,
+		 * cycle back to neutral.
+		 */
+		if (tp->snd_bandwidth <= tp->t_bw_bandwidth + tp->t_maxseg / 4)
+			tp->snd_bwnd -= tp->t_maxseg;
+		break;
+	case 0x08:
+		/*
+		 * Save the bandwidth and decrease bwnd.
+		 */
+		tp->t_bw_bandwidth = tp->snd_bandwidth;
+		tp->snd_bwnd -= tp->t_maxseg;
+		break;
+	case 0x0C:
+		/*
+		 * If the bandwidth goes down by more then maxseg / 4,
+		 * cycle back to neutral.  Otherwise keep the change.
+		 *
+		 * Note: in the bwnd-too-high case the bandwidth does not
+		 * usually change much so we tend to keep the change,
+		 * which means we tend to decrease bwnd.  This stabilizes
+		 * the algorithm.
+		 */
+		if (tp->snd_bandwidth <= tp->t_bw_bandwidth - tp->t_maxseg / 4)
+		    tp->snd_bwnd += tp->t_maxseg;
+		break;
+	default:
+		break;	/* no action */
+	}
+	if (tcp_inflight_debug) {
+		static int ltick;
+		if ((unsigned int)(ticks - ltick) > hz) {
+			printf("BW %ld (%ld) BWND %d srtt %d\n", 
+			    tp->snd_bandwidth, bw, tp->snd_bwnd, tp->t_srtt);
+			ltick = ticks;
+		}
+	}
+	if (tp->snd_bwnd < tcp_inflight_min)
+		tp->snd_bwnd = tcp_inflight_min;
+	if (tp->snd_bwnd < tp->t_maxseg * 2)
+		tp->snd_bwnd = tp->t_maxseg * 2;
+	if (tp->snd_bwnd > tcp_inflight_max)
+		tp->snd_bwnd = tcp_inflight_max;
+}
+
Index: tcp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_usrreq.c,v
retrieving revision 1.76
diff -u -r1.76 tcp_usrreq.c
--- tcp_usrreq.c	13 Jun 2002 23:14:58 -0000	1.76
+++ tcp_usrreq.c	20 Jul 2002 00:38:16 -0000
@@ -875,6 +875,7 @@
 	tp->t_state = TCPS_SYN_SENT;
 	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
 	tp->iss = tcp_new_isn(tp);
+	tp->t_bw_rtseq = tp->iss;
 	tcp_sendseqinit(tp);
 
 	/*
@@ -961,6 +962,7 @@
 	tp->t_state = TCPS_SYN_SENT;
 	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
 	tp->iss = tcp_new_isn(tp);
+	tp->t_bw_rtseq = tp->iss;
 	tcp_sendseqinit(tp);
 
 	/*
Index: tcp_var.h
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_var.h,v
retrieving revision 1.82
diff -u -r1.82 tcp_var.h
--- tcp_var.h	19 Jul 2002 18:27:39 -0000	1.82
+++ tcp_var.h	20 Jul 2002 00:38:17 -0000
@@ -124,10 +124,12 @@
 
 	u_long	snd_wnd;		/* send window */
 	u_long	snd_cwnd;		/* congestion-controlled window */
+	u_long	snd_bwnd;		/* bandwidth-controlled window */
 	u_long	snd_ssthresh;		/* snd_cwnd size threshold for
 					 * for slow start exponential to
 					 * linear switch
 					 */
+	u_long	snd_bandwidth;		/* calculated bandwidth or 0 */
 	tcp_seq	snd_recover;		/* for use in fast recovery */
 
 	u_int	t_maxopd;		/* mss plus options */
@@ -137,6 +139,11 @@
 	int	t_rtttime;		/* round trip time */
 	tcp_seq	t_rtseq;		/* sequence number being timed */
 
+	int	t_bw_rtttime;		/* used for bandwidth calculation */
+	tcp_seq	t_bw_rtseq;		/* used for bandwidth calculation */
+	int	t_bw_state;		/* used for snd_bwnd calculation */
+	u_long	t_bw_bandwidth;		/* used for snd_bwnd calculation */
+
 	int	t_rxtcur;		/* current retransmit value (ticks) */
 	u_int	t_maxseg;		/* maximum segment size */
 	int	t_srtt;			/* smoothed round-trip time */
@@ -473,6 +480,7 @@
 struct tcpcb *
 	 tcp_timers(struct tcpcb *, int);
 void	 tcp_trace(int, int, struct tcpcb *, void *, struct tcphdr *, int);
+void	 tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq);
 void	 syncache_init(void);
 void	 syncache_unreach(struct in_conninfo *, struct tcphdr *);
 int	 syncache_expand(struct in_conninfo *, struct tcphdr *,

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200207200103.g6K135Ap081155>