Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 15 Jul 2001 02:43:43 -0700 (PDT)
From:      Matt Dillon <dillon@earth.backplane.com>
To:        Matt Dillon <dillon@earth.backplane.com>
Cc:        Leo Bicknell <bicknell@ufp.org>, Drew Eckhardt <drew@PoohSticks.ORG>, hackers@FreeBSD.ORG
Subject:   eXperimental bandwidth delay product code (was Re: Network performance tuning.)
Message-ID:  <200107150943.f6F9hhx06763@earth.backplane.com>
References:  <200107130128.f6D1SFE59148@earth.backplane.com> <200107130217.f6D2HET67695@revolt.poohsticks.org> <20010712223042.A77503@ussenterprise.ufp.org> <200107131708.f6DH8ve65071@earth.backplane.com> <20010713132903.A21847@ussenterprise.ufp.org> <200107131847.f6DIlJv67457@earth.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
    Ok, here is a patch set that tries to adjust the transmit congestion
    window and socket buffer space according to the bandwidth product of
    the link.  THIS PATCH IS AGAINST STABLE! 

    I make calculations based on bandwidth and round-trip-time.  I spent 
    a lot of time trying to write an algorithm that just used one or the
    other, but it turns out that bandwidth is only a stable metric when
    you are reducing the window, and rtt is only a stable metric when
    you are increasing the window.

    The algorithm is basically:  decrease the window until we notice
    that the throughput is going down, then increase the window until we
    notice the RTT is going up (indicating buffering in the network).
    However, it took quite a few hours for me to find something that
    worked across a wide range of bandwidths and pipe delays.  I had to
    deal with oscillations at high bandwidths, instability with the
    metrics being used in certain situations, and calculation overshoot
    and undershoot due to averaging.  The biggest breakthrough occured when
    I stopped trying to time the code based on each ack coming back but
    instead timed it based on the round-trip-time interval (using the rtt
    calculation to trigger the windowing code).

    I used dummynet (aka 'ipfw pipe') as well as my LAN and two T1's two
    test it.

    sysctl's:

    net.inet.tcp.tcp_send_dynamic_enable

	0 -	disabled (old behavior) (default)
	1 -	enabled, no debugging output
	2 -	enabled, debug output to console (only really useful when
		testing one or two connections).

    net.inet.tcp.tcp_send_dynamic_min

	min buffering  (4096 default)

	This parameter specifies the absolute smallest buffer size the
	dynamic windowing code will go down to.  The default is 4096 bytes.
	You may want to set this to 4096 or 8192 to avoid degenerate
	conditions on very high speed networks, or if you want to enforce
	a minimum amount of socket buffering.


    I got some pretty awesome results when I tested it... I was able to
    create a really slow, low bandwidth dummynet link, start a transfer
    that utilizes 100% of the bandwidth, and I could still type in another
    xterm window that went through the same dummynet.  There are immediate
    uses for something like this for people who have modem links, not
    to mention many other reasons.

						    -Matt

Index: kern/uipc_socket.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.68.2.16
diff -u -r1.68.2.16 uipc_socket.c
--- kern/uipc_socket.c	2001/06/14 20:46:06	1.68.2.16
+++ kern/uipc_socket.c	2001/07/13 04:05:38
@@ -519,12 +519,44 @@
 			    snderr(so->so_proto->pr_flags & PR_CONNREQUIRED ?
 				   ENOTCONN : EDESTADDRREQ);
 		}
-		space = sbspace(&so->so_snd);
+
+		/*
+		 * Calculate the optimal write-buffer size and then reduce
+		 * by the amount already in use.  Special handling is required
+		 * to ensure that atomic writes still work as expected.
+		 *
+		 * Note: pru_sendpipe() only returns the optimal transmission
+		 * pipe size, which is roughly equivalent to what can be
+		 * transmitted and unacked.  To avoid excessive process 
+		 * wakeups we double the returned value for our recommended
+		 * buffer size.
+		 */
+		if (so->so_proto->pr_usrreqs->pru_sendpipe == NULL) {
+		    space = sbspace(&so->so_snd);
+		} else {
+		    space = (*so->so_proto->pr_usrreqs->pru_sendpipe)(so) * 2;
+		    if (atomic && space < resid + clen)
+			space = resid + clen;
+		    if (space < so->so_snd.sb_lowat)
+			space = so->so_snd.sb_lowat;
+		    if (space > so->so_snd.sb_hiwat)
+			space = so->so_snd.sb_hiwat;
+		    space = sbspace_using(&so->so_snd, space);
+		}
+
 		if (flags & MSG_OOB)
 			space += 1024;
+
+		/*
+		 * Error out if the request is impossible to satisfy.
+		 */
 		if ((atomic && resid > so->so_snd.sb_hiwat) ||
 		    clen > so->so_snd.sb_hiwat)
 			snderr(EMSGSIZE);
+
+		/*
+		 * Block if necessary.
+		 */
 		if (space < resid + clen && uio &&
 		    (atomic || space < so->so_snd.sb_lowat || space < clen)) {
 			if (so->so_state & SS_NBIO)
@@ -537,6 +569,7 @@
 			goto restart;
 		}
 		splx(s);
+
 		mp = &top;
 		space -= clen;
 		do {
Index: kern/uipc_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/uipc_usrreq.c,v
retrieving revision 1.54.2.5
diff -u -r1.54.2.5 uipc_usrreq.c
--- kern/uipc_usrreq.c	2001/03/05 13:09:01	1.54.2.5
+++ kern/uipc_usrreq.c	2001/07/13 03:56:02
@@ -427,7 +427,7 @@
 	uipc_connect2, pru_control_notsupp, uipc_detach, uipc_disconnect,
 	uipc_listen, uipc_peeraddr, uipc_rcvd, pru_rcvoob_notsupp,
 	uipc_send, uipc_sense, uipc_shutdown, uipc_sockaddr,
-	sosend, soreceive, sopoll
+	sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
 	
 /*
Index: net/raw_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/net/raw_usrreq.c,v
retrieving revision 1.18
diff -u -r1.18 raw_usrreq.c
--- net/raw_usrreq.c	1999/08/28 00:48:28	1.18
+++ net/raw_usrreq.c	2001/07/13 03:56:12
@@ -296,5 +296,5 @@
 	pru_connect2_notsupp, pru_control_notsupp, raw_udetach, 
 	raw_udisconnect, pru_listen_notsupp, raw_upeeraddr, pru_rcvd_notsupp,
 	pru_rcvoob_notsupp, raw_usend, pru_sense_null, raw_ushutdown,
-	raw_usockaddr, sosend, soreceive, sopoll
+	raw_usockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
Index: net/rtsock.c
===================================================================
RCS file: /home/ncvs/src/sys/net/rtsock.c,v
retrieving revision 1.44.2.4
diff -u -r1.44.2.4 rtsock.c
--- net/rtsock.c	2001/07/11 09:37:37	1.44.2.4
+++ net/rtsock.c	2001/07/13 03:56:16
@@ -266,7 +266,7 @@
 	pru_connect2_notsupp, pru_control_notsupp, rts_detach, rts_disconnect,
 	pru_listen_notsupp, rts_peeraddr, pru_rcvd_notsupp, pru_rcvoob_notsupp,
 	rts_send, pru_sense_null, rts_shutdown, rts_sockaddr,
-	sosend, soreceive, sopoll
+	sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
 
 /*ARGSUSED*/
Index: netatalk/ddp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netatalk/ddp_usrreq.c,v
retrieving revision 1.17
diff -u -r1.17 ddp_usrreq.c
--- netatalk/ddp_usrreq.c	1999/04/27 12:21:14	1.17
+++ netatalk/ddp_usrreq.c	2001/07/13 03:56:25
@@ -581,5 +581,6 @@
 	at_setsockaddr,
 	sosend,
 	soreceive,
-	sopoll
+	sopoll, 
+	pru_sendpipe_notsupp
 };
Index: netatm/atm_aal5.c
===================================================================
RCS file: /home/ncvs/src/sys/netatm/atm_aal5.c,v
retrieving revision 1.6
diff -u -r1.6 atm_aal5.c
--- netatm/atm_aal5.c	1999/10/09 23:24:59	1.6
+++ netatm/atm_aal5.c	2001/07/13 03:56:40
@@ -101,7 +101,8 @@
 	atm_aal5_sockaddr,		/* pru_sockaddr */
 	sosend,				/* pru_sosend */
 	soreceive,			/* pru_soreceive */
-	sopoll				/* pru_sopoll */
+	sopoll,				/* pru_sopoll */
+	pru_sendpipe_notsupp		/* pru_sendpipe */
 };
 #endif
 
Index: netatm/atm_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netatm/atm_usrreq.c,v
retrieving revision 1.6
diff -u -r1.6 atm_usrreq.c
--- netatm/atm_usrreq.c	1999/08/28 00:48:39	1.6
+++ netatm/atm_usrreq.c	2001/07/13 03:58:57
@@ -73,6 +73,10 @@
 	pru_sense_null,			/* pru_sense */
 	atm_proto_notsupp1,		/* pru_shutdown */
 	atm_proto_notsupp3,		/* pru_sockaddr */
+	NULL,				/* pru_sosend */
+	NULL,				/* pru_soreceive */
+	NULL,				/* pru_sopoll */
+	pru_sendpipe_notsupp            /* pru_sendpipe */
 };
 #endif
 
Index: netgraph/ng_socket.c
===================================================================
RCS file: /home/ncvs/src/sys/netgraph/ng_socket.c,v
retrieving revision 1.11.2.3
diff -u -r1.11.2.3 ng_socket.c
--- netgraph/ng_socket.c	2001/02/02 11:59:27	1.11.2.3
+++ netgraph/ng_socket.c	2001/07/13 03:59:30
@@ -907,7 +907,8 @@
 	ng_setsockaddr,
 	sosend,
 	soreceive,
-	sopoll
+	sopoll,
+	pru_sendpipe_notsupp
 };
 
 static struct pr_usrreqs ngd_usrreqs = {
@@ -930,7 +931,8 @@
 	ng_setsockaddr,
 	sosend,
 	soreceive,
-	sopoll
+	sopoll,
+	pru_sendpipe_notsupp
 };
 
 /*
Index: netinet/ip_divert.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/ip_divert.c,v
retrieving revision 1.42.2.3
diff -u -r1.42.2.3 ip_divert.c
--- netinet/ip_divert.c	2001/02/27 09:41:15	1.42.2.3
+++ netinet/ip_divert.c	2001/07/13 03:59:47
@@ -540,5 +540,5 @@
 	pru_connect_notsupp, pru_connect2_notsupp, in_control, div_detach,
 	div_disconnect, pru_listen_notsupp, in_setpeeraddr, pru_rcvd_notsupp,
 	pru_rcvoob_notsupp, div_send, pru_sense_null, div_shutdown,
-	in_setsockaddr, sosend, soreceive, sopoll
+	in_setsockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
Index: netinet/raw_ip.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/raw_ip.c,v
retrieving revision 1.64.2.6
diff -u -r1.64.2.6 raw_ip.c
--- netinet/raw_ip.c	2001/07/03 11:01:46	1.64.2.6
+++ netinet/raw_ip.c	2001/07/13 03:59:56
@@ -680,5 +680,5 @@
 	pru_connect2_notsupp, in_control, rip_detach, rip_disconnect,
 	pru_listen_notsupp, in_setpeeraddr, pru_rcvd_notsupp,
 	pru_rcvoob_notsupp, rip_send, pru_sense_null, rip_shutdown,
-	in_setsockaddr, sosend, soreceive, sopoll
+	in_setsockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
Index: netinet/tcp_input.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_input.c,v
retrieving revision 1.107.2.15
diff -u -r1.107.2.15 tcp_input.c
--- netinet/tcp_input.c	2001/07/08 02:21:43	1.107.2.15
+++ netinet/tcp_input.c	2001/07/15 09:23:07
@@ -132,6 +132,14 @@
     &drop_synfin, 0, "Drop TCP packets with SYN+FIN set");
 #endif
 
+int      tcp_send_dynamic_enable = 0;
+SYSCTL_INT(_net_inet_tcp, OID_AUTO, tcp_send_dynamic_enable, CTLFLAG_RW,
+    &tcp_send_dynamic_enable, 0, "enable dynamic control of sendspace");
+int      tcp_send_dynamic_min = 4096;
+SYSCTL_INT(_net_inet_tcp, OID_AUTO, tcp_send_dynamic_min, CTLFLAG_RW,
+    &tcp_send_dynamic_min, 0, "set minimum dynamic buffer space");
+
+
 struct inpcbhead tcb;
 #define	tcb6	tcb  /* for KAME src sync over BSD*'s */
 struct inpcbinfo tcbinfo;
@@ -142,8 +150,9 @@
 	    struct tcphdr *, struct mbuf *, int));
 static int	 tcp_reass __P((struct tcpcb *, struct tcphdr *, int *,
 				struct mbuf *));
-static void	 tcp_xmit_timer __P((struct tcpcb *, int));
+static void	 tcp_xmit_timer __P((struct tcpcb *, int, tcp_seq));
 static int	 tcp_newreno __P((struct tcpcb *, struct tcphdr *));
+static void	tcp_ack_dynamic_cwnd(struct tcpcb *tp, struct socket *so);
 
 /* Neighbor Discovery, Neighbor Unreachability Detection Upper layer hint. */
 #ifdef INET6
@@ -931,12 +940,16 @@
 					tp->snd_nxt = tp->snd_max;
 					tp->t_badrxtwin = 0;
 				}
-				if ((to.to_flag & TOF_TS) != 0)
-					tcp_xmit_timer(tp,
-					    ticks - to.to_tsecr + 1);
-				else if (tp->t_rtttime &&
-					    SEQ_GT(th->th_ack, tp->t_rtseq))
-					tcp_xmit_timer(tp, ticks - tp->t_rtttime);
+				/*
+			         * note: do not include a sequence number 
+				 * for anything but t_rtttime timings, see
+				 * tcp_xmit_timer().
+				 */
+				if (tp->t_rtttime &&
+				    SEQ_GT(th->th_ack, tp->t_rtseq))
+					tcp_xmit_timer(tp, tp->t_rtttime, tp->t_rtseq);
+				else if ((to.to_flag & TOF_TS) != 0)
+					tcp_xmit_timer(tp, to.to_tsecr - 1, 0);
 				acked = th->th_ack - tp->snd_una;
 				tcpstat.tcps_rcvackpack++;
 				tcpstat.tcps_rcvackbyte += acked;
@@ -1927,11 +1940,14 @@
 		 * Since we now have an rtt measurement, cancel the
 		 * timer backoff (cf., Phil Karn's retransmit alg.).
 		 * Recompute the initial retransmit timer.
+		 *
+		 * note: do not include a sequence number for anything
+		 * but t_rtttime timings, see tcp_xmit_timer().
 		 */
-		if (to.to_flag & TOF_TS)
-			tcp_xmit_timer(tp, ticks - to.to_tsecr + 1);
-		else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
-			tcp_xmit_timer(tp, ticks - tp->t_rtttime);
+		if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
+			tcp_xmit_timer(tp, tp->t_rtttime, tp->t_rtseq);
+		else if (to.to_flag & TOF_TS)
+			tcp_xmit_timer(tp, to.to_tsecr - 1, 0);
 
 		/*
 		 * If all outstanding data is acked, stop retransmit
@@ -1955,25 +1971,40 @@
 
 		/*
 		 * When new data is acked, open the congestion window.
-		 * If the window gives us less than ssthresh packets
-		 * in flight, open exponentially (maxseg per packet).
-		 * Otherwise open linearly: maxseg per window
-		 * (maxseg^2 / cwnd per packet).
-		 */
-		{
-		register u_int cw = tp->snd_cwnd;
-		register u_int incr = tp->t_maxseg;
-
-		if (cw > tp->snd_ssthresh)
-			incr = incr * incr / cw;
-		/*
+		 * We no longer use ssthresh because it just does not work
+		 * right.   Instead we try to avoid packet loss alltogether
+		 * by avoiding excessive buffering of packet data in the
+		 * network.  
+		 *
 		 * If t_dupacks != 0 here, it indicates that we are still
 		 * in NewReno fast recovery mode, so we leave the congestion
 		 * window alone.
 		 */
-		if (tcp_do_newreno == 0 || tp->t_dupacks == 0)
-			tp->snd_cwnd = min(cw + incr,TCP_MAXWIN<<tp->snd_scale);
+
+		if (tcp_do_newreno == 0 || tp->t_dupacks == 0) {
+			if (tp->t_txbandwidth && tcp_send_dynamic_enable) {
+				tcp_ack_dynamic_cwnd(tp, so);
+			} else {
+				int incr = tp->t_maxseg;
+				if (tp->snd_cwnd > tp->snd_ssthresh)
+					incr = incr * incr / tp->snd_cwnd;
+				tp->snd_cwnd += incr;
+			}
+			/*
+			 * Enforce the minimum and maximum congestion window.
+			 * Remember, this whole section is hit when we get a
+			 * good ack so our window is at least 2 packets.
+			 */
+			if (tp->snd_cwnd > (TCP_MAXWIN << tp->snd_scale))
+				tp->snd_cwnd = TCP_MAXWIN << tp->snd_scale;
+			if (tp->snd_cwnd < tp->t_maxseg * 2)
+				tp->snd_cwnd = tp->t_maxseg * 2;
 		}
+
+		/*
+		 * Clean out buffered transmit data that we no longer need
+		 * to keep around.
+		 */
 		if (acked > so->so_snd.sb_cc) {
 			tp->snd_wnd -= so->so_snd.sb_cc;
 			sbdrop(&so->so_snd, (int)so->so_snd.sb_cc);
@@ -2531,19 +2562,135 @@
 	panic("tcp_pulloutofband");
 }
 
+/*
+ * Dynamically adjust the congestion window.  The sweet spot is slightly
+ * higher then the point where the bandwidth begins to degrade.  Beyond
+ * that and the extra packets wind up being buffered in the network.
+ *
+ * We use an assymetric algorithm.  We increase the window until we see 
+ * a 5% increase the round-trip-time (SRTT).  We then assume that this is
+ * the saturation point and decrease the window until we see a loss in
+ * bandwidth.
+ *
+ * This routine is master-timed off the round-trip time of the packet,
+ * allowing us to count round trips.  Since bandwidth changes need at
+ * least an rtt cycle to occur, this is much better then counting packets
+ * and should be independant of bandwidth, pipe size, etc...
+ */
+
+#define CWND_COUNT_START		2*1
+#define CWND_COUNT_DECR			2*3
+#define CWND_COUNT_INCR			(CWND_COUNT_DECR + 2*8)
+#define CWND_COUNT_STABILIZED		(CWND_COUNT_INCR + 2*4)
+#define CWND_COUNT_IMPROVING		(CWND_COUNT_STABILIZED + 2*2)
+#define CWND_COUNT_NOT_IMPROVING	(CWND_COUNT_IMPROVING + 2*8)
+
+static void
+tcp_ack_dynamic_cwnd(struct tcpcb *tp, struct socket *so)
+{
+	/*
+	 * Make adjustments only at every complete round trip.
+	 */
+	if ((tp->t_txbwcount & 1) == 0)
+		return;
+	++tp->t_txbwcount;
+	if (tp->t_txbwcount == CWND_COUNT_START) {
+		/*
+		 * Set a rtt performance loss target of 20%
+		 */
+		tp->t_last_txbandwidth = tp->t_srtt + tp->t_srtt / 5;
+	} else if (tp->t_txbwcount >= CWND_COUNT_DECR &&
+	    tp->t_txbwcount < CWND_COUNT_INCR &&
+	    tp->t_srtt < tp->t_last_txbandwidth) {
+		/*
+		 * Increase cwnd in maxseg chunks until we hit our target.  
+		 * The target represents the point where packets are starting
+		 * to be buffered significantly in the network.
+		 */
+		tp->snd_cwnd += tp->t_maxseg;
+		tp->t_txbwcount = CWND_COUNT_START;
+
+		/*
+		 * snap target, required to avoid oscillation at high
+		 * bandwidths
+		 */
+		if (tp->t_last_txbandwidth > tp->t_srtt + tp->t_srtt / 5)
+			tp->t_last_txbandwidth = tp->t_srtt + tp->t_srtt / 5;
+		/*
+		 * Switch directions if we hit the top.
+		 */
+		if (tp->snd_cwnd >= so->so_snd.sb_hiwat ||
+		    tp->snd_cwnd >= (TCP_MAXWIN << tp->snd_scale)) {
+			tp->snd_cwnd = min(so->so_snd.sb_hiwat, (TCP_MAXWIN << tp->snd_scale));
+			tp->t_txbwcount = CWND_COUNT_INCR - 2;
+		}
+	} else if (tp->t_txbwcount == CWND_COUNT_INCR) {
+		/*
+		 * We hit 5% performance loss.  Do nothing (wait until
+		 * we stabilize).
+		 */
+	} else if (tp->t_txbwcount == CWND_COUNT_STABILIZED) {
+		/*
+		 * srtt started to go up, we are at the pipe limit and
+		 * must be at the maximum bandwidth.  Reduce the window
+		 * size until we loose 5% of our bandwidth.  Use smaller
+		 * chunks to avoid overshooting.
+		 */
+		tp->t_last_txbandwidth = tp->t_txbandwidth - tp->t_txbandwidth / 20;
+		tp->snd_cwnd -= tp->t_maxseg / 3;
+	} else if (tp->t_txbwcount >= CWND_COUNT_IMPROVING && 
+	    tp->t_txbandwidth > tp->t_last_txbandwidth) {
+		/*
+		 * We saw an improvement, bump the window again, loop this
+		 * state.  If the pipeline isn't full then adding another
+		 * packet should improve bandwidth by t_maxseg.  Use seg / 4
+		 * to deal with any noise. 
+		 */
+		tp->snd_cwnd -= tp->t_maxseg / 3;
+
+		/*
+		 * snap target, required to avoid oscillation at high
+		 * bandwidths
+		 */
+		tp->t_txbwcount = CWND_COUNT_STABILIZED;
+		if (tp->t_last_txbandwidth < tp->t_txbandwidth - tp->t_txbandwidth / 20)
+			tp->t_last_txbandwidth = tp->t_txbandwidth - tp->t_txbandwidth / 20;
+		/*
+		 * Switch directions if we hit bottom.
+		 */
+		if (tp->snd_cwnd < tcp_send_dynamic_min ||
+		    tp->snd_cwnd <= tp->t_maxseg * 2) {
+			tp->snd_cwnd = max(tcp_send_dynamic_min, tp->t_maxseg);
+			tp->t_txbwcount = 0;
+		}
+	} else if (tp->t_txbwcount >= CWND_COUNT_NOT_IMPROVING) {
+		/*
+		 * No improvement, start upward again.  loop to recalculate
+		 * the -5%.  We can recalculate immediately and do not require
+		 * additional stabilization time.
+		 */
+		tp->snd_cwnd += tp->t_maxseg / 2;
+		tp->t_txbwcount = 0;
+	}
+}
+
 /*
- * Collect new round-trip time estimate
- * and update averages and current timeout.
+ * Collect new round-trip time estimate and update averages, current timeout,
+ * and transmit bandwidth.
  */
 static void
-tcp_xmit_timer(tp, rtt)
+tcp_xmit_timer(tp, rtttime, rtseq)
 	register struct tcpcb *tp;
-	int rtt;
+	int rtttime;
+	tcp_seq rtseq;
 {
-	register int delta;
+	int delta;
+	int rtt;
 
 	tcpstat.tcps_rttupdated++;
 	tp->t_rttupdated++;
+
+	rtt = ticks - rtttime;
 	if (tp->t_srtt != 0) {
 		/*
 		 * srtt is stored as fixed point with 5 bits after the
@@ -2582,8 +2729,30 @@
 		tp->t_srtt = rtt << TCP_RTT_SHIFT;
 		tp->t_rttvar = rtt << (TCP_RTTVAR_SHIFT - 1);
 	}
-	tp->t_rtttime = 0;
 	tp->t_rxtshift = 0;
+
+	/*
+	 * Calculate the transmit-side throughput, in bytes/sec.  This is
+	 * used to dynamically size the congestion window to the pipe.  We
+	 * average over 2 packets only.  rtseq is only passed for t_rtttime
+	 * based timings, which in turn only occur on an interval close to
+	 * the round trip time of the packet.  We have to do this in order
+	 * to get accurate bandwidths without having to take a long term
+	 * average, which blows up the dynamic windowing algorithm.
+	 */
+	if (rtseq && rtt) {
+		tp->t_rtttime = 0;
+		if (tp->t_last_rtseq) {
+			int bw;
+
+			bw = (rtseq - tp->t_last_rtseq) * hz / rtt;
+			bw = (tp->t_txbandwidth + bw) / 2;
+			tp->t_txbandwidth = bw;
+			tp->t_txbwcount |= 1;
+		}
+		tp->t_last_rtseq = rtseq;
+		tp->t_last_rtttime = rtttime;
+	}
 
 	/*
 	 * the retransmit should happen at rtt + 4 * rttvar.
Index: netinet/tcp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_usrreq.c,v
retrieving revision 1.51.2.7
diff -u -r1.51.2.7 tcp_usrreq.c
--- netinet/tcp_usrreq.c	2001/07/08 02:21:44	1.51.2.7
+++ netinet/tcp_usrreq.c	2001/07/15 05:31:52
@@ -494,6 +494,47 @@
 }
 
 /*
+ * Calculate the optimal transmission pipe size.  This is used to limit the
+ * amount of data we allow to be buffered in order to reduce memory use,
+ * allowing connections to dynamically adjust to the bandwidth product of
+ * their links.
+ *
+ * For tcp we return approximately the congestion window size, which
+ * winds up being the bandwidth delay product in a lossless environment.
+ */
+static int
+tcp_usr_sendpipe(struct socket *so)
+{
+	struct inpcb *inp;
+	int size = so->so_snd.sb_hiwat;
+
+	if (tcp_send_dynamic_enable && (inp = sotoinpcb(so)) != NULL) {
+		struct tcpcb *tp;
+
+		if ((tp = intotcpcb(inp)) != NULL) {
+			size = tp->snd_cwnd;
+			if (size > tp->snd_wnd)
+				size = tp->snd_wnd;
+
+			/*
+			 * debugging & minimum transmit buffer availability
+			 */
+			if (tcp_send_dynamic_enable > 1) {
+				static int last_hz;
+
+				if (last_hz != ticks / hz) {
+					last_hz = ticks / hz;
+					printf("tcp_user_sendpipe: size=%d bw=%d lbw=%d count=%d srtt=%d\n", size, tp->t_txbandwidth, tp->t_last_txbandwidth, tp->t_txbwcount, tp->t_srtt);
+				}
+			}
+			if (size < tcp_send_dynamic_min)
+				size = tcp_send_dynamic_min;
+		}
+	}
+	return(size);
+}
+
+/*
  * Do a send by putting data in output queue and updating urgent
  * marker if URG set.  Possibly send more data.  Unlike the other
  * pru_*() routines, the mbuf chains are our responsibility.  We
@@ -674,7 +715,7 @@
 	tcp_usr_connect, pru_connect2_notsupp, in_control, tcp_usr_detach,
 	tcp_usr_disconnect, tcp_usr_listen, in_setpeeraddr, tcp_usr_rcvd,
 	tcp_usr_rcvoob, tcp_usr_send, pru_sense_null, tcp_usr_shutdown,
-	in_setsockaddr, sosend, soreceive, sopoll
+	in_setsockaddr, sosend, soreceive, sopoll, tcp_usr_sendpipe
 };
 
 #ifdef INET6
@@ -683,7 +724,7 @@
 	tcp6_usr_connect, pru_connect2_notsupp, in6_control, tcp_usr_detach,
 	tcp_usr_disconnect, tcp6_usr_listen, in6_mapped_peeraddr, tcp_usr_rcvd,
 	tcp_usr_rcvoob, tcp_usr_send, pru_sense_null, tcp_usr_shutdown,
-	in6_mapped_sockaddr, sosend, soreceive, sopoll
+	in6_mapped_sockaddr, sosend, soreceive, sopoll, tcp_usr_sendpipe
 };
 #endif /* INET6 */
 
Index: netinet/tcp_var.h
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_var.h,v
retrieving revision 1.56.2.7
diff -u -r1.56.2.7 tcp_var.h
--- netinet/tcp_var.h	2001/07/08 02:21:44	1.56.2.7
+++ netinet/tcp_var.h	2001/07/15 07:25:48
@@ -95,6 +95,7 @@
 #define	TF_SENDCCNEW	0x08000		/* send CCnew instead of CC in SYN */
 #define	TF_MORETOCOME	0x10000		/* More data to be appended to sock */
 #define	TF_LQ_OVERFLOW	0x20000		/* listen queue overflow */
+#define TF_BWSCANUP	0x40000
 	int	t_force;		/* 1 if forcing out a byte */
 
 	tcp_seq	snd_una;		/* send unacknowledged */
@@ -128,6 +129,11 @@
 	u_long	t_starttime;		/* time connection was established */
 	int	t_rtttime;		/* round trip time */
 	tcp_seq	t_rtseq;		/* sequence number being timed */
+	int	t_last_rtttime;
+	tcp_seq	t_last_rtseq;		/* sequence number being timed */
+	int	t_txbandwidth;		/* transmit bandwidth/delay */
+	int	t_last_txbandwidth;
+	int	t_txbwcount;
 
 	int	t_rxtcur;		/* current retransmit value (ticks) */
 	u_int	t_maxseg;		/* maximum segment size */
@@ -371,6 +377,8 @@
 extern	int tcp_do_newreno;
 extern	int ss_fltsz;
 extern	int ss_fltsz_local;
+extern	int tcp_send_dynamic_enable;
+extern	int tcp_send_dynamic_min;
 
 void	 tcp_canceltimers __P((struct tcpcb *));
 struct tcpcb *
Index: netinet/udp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/udp_usrreq.c,v
retrieving revision 1.64.2.11
diff -u -r1.64.2.11 udp_usrreq.c
--- netinet/udp_usrreq.c	2001/07/03 11:01:47	1.64.2.11
+++ netinet/udp_usrreq.c	2001/07/13 04:00:17
@@ -923,6 +923,6 @@
 	pru_connect2_notsupp, in_control, udp_detach, udp_disconnect, 
 	pru_listen_notsupp, in_setpeeraddr, pru_rcvd_notsupp, 
 	pru_rcvoob_notsupp, udp_send, pru_sense_null, udp_shutdown,
-	in_setsockaddr, sosend, soreceive, sopoll
+	in_setsockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
 
Index: netinet6/raw_ip6.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet6/raw_ip6.c,v
retrieving revision 1.7.2.3
diff -u -r1.7.2.3 raw_ip6.c
--- netinet6/raw_ip6.c	2001/07/03 11:01:55	1.7.2.3
+++ netinet6/raw_ip6.c	2001/07/13 04:00:25
@@ -733,5 +733,5 @@
 	pru_connect2_notsupp, in6_control, rip6_detach, rip6_disconnect,
 	pru_listen_notsupp, in6_setpeeraddr, pru_rcvd_notsupp,
 	pru_rcvoob_notsupp, rip6_send, pru_sense_null, rip6_shutdown,
-	in6_setsockaddr, sosend, soreceive, sopoll
+	in6_setsockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
Index: netipx/ipx_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netipx/ipx_usrreq.c,v
retrieving revision 1.26.2.1
diff -u -r1.26.2.1 ipx_usrreq.c
--- netipx/ipx_usrreq.c	2001/02/22 09:44:18	1.26.2.1
+++ netipx/ipx_usrreq.c	2001/07/13 04:00:38
@@ -89,7 +89,7 @@
 	ipx_connect, pru_connect2_notsupp, ipx_control, ipx_detach,
 	ipx_disconnect, pru_listen_notsupp, ipx_peeraddr, pru_rcvd_notsupp,
 	pru_rcvoob_notsupp, ipx_send, pru_sense_null, ipx_shutdown,
-	ipx_sockaddr, sosend, soreceive, sopoll
+	ipx_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
 
 struct	pr_usrreqs ripx_usrreqs = {
@@ -97,7 +97,7 @@
 	ipx_connect, pru_connect2_notsupp, ipx_control, ipx_detach,
 	ipx_disconnect, pru_listen_notsupp, ipx_peeraddr, pru_rcvd_notsupp,
 	pru_rcvoob_notsupp, ipx_send, pru_sense_null, ipx_shutdown,
-	ipx_sockaddr, sosend, soreceive, sopoll
+	ipx_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
 
 /*
Index: netipx/spx_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netipx/spx_usrreq.c,v
retrieving revision 1.27.2.1
diff -u -r1.27.2.1 spx_usrreq.c
--- netipx/spx_usrreq.c	2001/02/22 09:44:18	1.27.2.1
+++ netipx/spx_usrreq.c	2001/07/13 04:00:46
@@ -107,7 +107,7 @@
 	spx_connect, pru_connect2_notsupp, ipx_control, spx_detach,
 	spx_usr_disconnect, spx_listen, ipx_peeraddr, spx_rcvd,
 	spx_rcvoob, spx_send, pru_sense_null, spx_shutdown,
-	ipx_sockaddr, sosend, soreceive, sopoll
+	ipx_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
 
 struct	pr_usrreqs spx_usrreq_sps = {
@@ -115,7 +115,7 @@
 	spx_connect, pru_connect2_notsupp, ipx_control, spx_detach,
 	spx_usr_disconnect, spx_listen, ipx_peeraddr, spx_rcvd,
 	spx_rcvoob, spx_send, pru_sense_null, spx_shutdown,
-	ipx_sockaddr, sosend, soreceive, sopoll
+	ipx_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
 
 void
Index: netkey/keysock.c
===================================================================
RCS file: /home/ncvs/src/sys/netkey/keysock.c,v
retrieving revision 1.1.2.2
diff -u -r1.1.2.2 keysock.c
--- netkey/keysock.c	2001/07/03 11:02:00	1.1.2.2
+++ netkey/keysock.c	2001/07/13 04:00:51
@@ -586,7 +586,7 @@
 	key_disconnect, pru_listen_notsupp, key_peeraddr,
 	pru_rcvd_notsupp,
 	pru_rcvoob_notsupp, key_send, pru_sense_null, key_shutdown,
-	key_sockaddr, sosend, soreceive, sopoll
+	key_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
 
 /* sysctl */
Index: netnatm/natm.c
===================================================================
RCS file: /home/ncvs/src/sys/netnatm/natm.c,v
retrieving revision 1.12
diff -u -r1.12 natm.c
--- netnatm/natm.c	2000/02/13 03:32:03	1.12
+++ netnatm/natm.c	2001/07/13 04:01:15
@@ -413,7 +413,7 @@
 	natm_usr_detach, natm_usr_disconnect, pru_listen_notsupp,
 	natm_usr_peeraddr, pru_rcvd_notsupp, pru_rcvoob_notsupp,
 	natm_usr_send, pru_sense_null, natm_usr_shutdown,
-	natm_usr_sockaddr, sosend, soreceive, sopoll
+	natm_usr_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
 };
 
 #else  /* !FREEBSD_USRREQS */
Index: sys/protosw.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/protosw.h,v
retrieving revision 1.28.2.2
diff -u -r1.28.2.2 protosw.h
--- sys/protosw.h	2001/07/03 11:02:01	1.28.2.2
+++ sys/protosw.h	2001/07/13 04:02:15
@@ -228,6 +228,7 @@
 				      struct mbuf **controlp, int *flagsp));
 	int	(*pru_sopoll) __P((struct socket *so, int events,
 				     struct ucred *cred, struct proc *p));
+	int	(*pru_sendpipe) __P((struct socket *so));
 };
 
 int	pru_accept_notsupp __P((struct socket *so, struct sockaddr **nam));
@@ -240,6 +241,7 @@
 int	pru_rcvd_notsupp __P((struct socket *so, int flags));
 int	pru_rcvoob_notsupp __P((struct socket *so, struct mbuf *m, int flags));
 int	pru_sense_null __P((struct socket *so, struct stat *sb));
+#define	pru_sendpipe_notsupp	NULL
 
 #endif /* _KERNEL */
 
Index: sys/socketvar.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/socketvar.h,v
retrieving revision 1.46.2.5
diff -u -r1.46.2.5 socketvar.h
--- sys/socketvar.h	2001/02/26 04:23:21	1.46.2.5
+++ sys/socketvar.h	2001/07/13 03:47:25
@@ -188,9 +188,11 @@
  * still be negative (cc > hiwat or mbcnt > mbmax).  Should detect
  * overflow and return 0.  Should use "lmin" but it doesn't exist now.
  */
-#define	sbspace(sb) \
-    ((long) imin((int)((sb)->sb_hiwat - (sb)->sb_cc), \
+#define sbspace_using(sb, hiwat) \
+    ((long) imin((int)((hiwat) - (sb)->sb_cc), \
 	 (int)((sb)->sb_mbmax - (sb)->sb_mbcnt)))
+
+#define	sbspace(sb)	sbspace_using(sb, (sb)->sb_hiwat)
 
 /* do we have to send all at once on a socket? */
 #define	sosendallatonce(so) \

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200107150943.f6F9hhx06763>