From owner-p4-projects@FreeBSD.ORG Sat Jul 11 10:34:04 2009 Return-Path: Delivered-To: p4-projects@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 32767) id F19921065673; Sat, 11 Jul 2009 10:34:03 +0000 (UTC) Delivered-To: perforce@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9638A106566C for ; Sat, 11 Jul 2009 10:34:03 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from repoman.freebsd.org (repoman.freebsd.org [IPv6:2001:4f8:fff6::29]) by mx1.freebsd.org (Postfix) with ESMTP id 823E18FC14 for ; Sat, 11 Jul 2009 10:34:03 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from repoman.freebsd.org (localhost [127.0.0.1]) by repoman.freebsd.org (8.14.3/8.14.3) with ESMTP id n6BAY3na089267 for ; Sat, 11 Jul 2009 10:34:03 GMT (envelope-from andre@freebsd.org) Received: (from perforce@localhost) by repoman.freebsd.org (8.14.3/8.14.3/Submit) id n6BAY3Gg089265 for perforce@freebsd.org; Sat, 11 Jul 2009 10:34:03 GMT (envelope-from andre@freebsd.org) Date: Sat, 11 Jul 2009 10:34:03 GMT Message-Id: <200907111034.n6BAY3Gg089265@repoman.freebsd.org> X-Authentication-Warning: repoman.freebsd.org: perforce set sender to andre@freebsd.org using -f From: Andre Oppermann To: Perforce Change Reviews Cc: Subject: PERFORCE change 165925 for review X-BeenThere: p4-projects@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: p4 projects tree changes List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Jul 2009 10:34:04 -0000 http://perforce.freebsd.org/chv.cgi?CH=165925 Change 165925 by andre@andre_t61 on 2009/07/11 10:33:34 Try a different approach to the SACK scoreboard using a ranged RB-tree instead of a TAILQ. The maintenance of the scoreboard is much simpler. Affected files ... .. //depot/projects/tcp_new/netinet/tcp_sack.c#3 edit .. //depot/projects/tcp_new/netinet/tcp_subr.c#6 edit .. //depot/projects/tcp_new/netinet/tcp_var.h#11 edit Differences ... ==== //depot/projects/tcp_new/netinet/tcp_sack.c#3 (text+ko) ==== @@ -1,6 +1,6 @@ /*- - * Copyright (c) 1982, 1986, 1988, 1990, 1993, 1994, 1995 - * The Regents of the University of California. + * Copyright (c) 2009 + * Andre Oppermann, Internet Business Solutions AG. * All rights reserved. * * Redistribution and use in source and binary forms, with or without @@ -11,7 +11,7 @@ * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. - * 4. Neither the name of the University nor the names of its contributors + * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * @@ -30,46 +30,6 @@ * @(#)tcp_sack.c 8.12 (Berkeley) 5/24/95 */ -/*- - * @@(#)COPYRIGHT 1.1 (NRL) 17 January 1995 - * - * NRL grants permission for redistribution and use in source and binary - * forms, with or without modification, of the software and documentation - * created at NRL provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * 3. All advertising materials mentioning features or use of this software - * must display the following acknowledgements: - * This product includes software developed by the University of - * California, Berkeley and its contributors. - * This product includes software developed at the Information - * Technology Division, US Naval Research Laboratory. - * 4. Neither the name of the NRL nor the names of its contributors - * may be used to endorse or promote products derived from this software - * without specific prior written permission. - * - * THE SOFTWARE PROVIDED BY NRL IS PROVIDED BY NRL AND CONTRIBUTORS ``AS - * IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED - * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A - * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NRL OR - * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, - * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, - * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR - * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF - * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING - * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation - * are those of the authors and should not be interpreted as representing - * official policies, either expressed or implied, of the US Naval - * Research Laboratory (NRL). - */ - #include __FBSDID("$FreeBSD: src/sys/netinet/tcp_sack.c,v 1.40 2007/05/11 11:21:43 rwatson Exp $"); @@ -119,11 +79,12 @@ #include #endif /* TCPDEBUG */ -#include +/* + * Store all SACK blocks of the scoreboard in a ranged red-black tree. + */ -extern struct uma_zone *sack_hole_zone; +SYSCTL_NODE(_net_inet_tcp, OID_AUTO, sack, CTLFLAG_RW, 0, "TCP SACK"); -SYSCTL_NODE(_net_inet_tcp, OID_AUTO, sack, CTLFLAG_RW, 0, "TCP SACK"); int tcp_do_sack = 1; SYSCTL_INT(_net_inet_tcp_sack, OID_AUTO, enable, CTLFLAG_RW, &tcp_do_sack, 0, "Enable/Disable TCP SACK support"); @@ -139,544 +100,168 @@ &tcp_sack_globalmaxholes, 0, "Global maximum number of TCP SACK holes"); -static int tcp_sack_globalholes = 0; -SYSCTL_INT(_net_inet_tcp_sack, OID_AUTO, globalholes, CTLFLAG_RD, - &tcp_sack_globalholes, 0, - "Global number of TCP SACK holes currently allocated"); +static uma_zone_t tcp_sackblock_zone; -/* - * This function is called upon receipt of new valid data (while not in - * header prediction mode), and it updates the ordered list of sacks. - */ void -tcp_update_sack_list(struct tcpcb *tp, tcp_seq rcv_start, tcp_seq rcv_end) +tcp_sack_init(void) { - /* - * First reported block MUST be the most recent one. Subsequent - * blocks SHOULD be in the order in which they arrived at the - * receiver. These two conditions make the implementation fully - * compliant with RFC 2018. - */ - struct sackblk head_blk, saved_blks[MAX_SACK_BLKS]; - int num_head, num_saved, i; - - INP_LOCK_ASSERT(tp->t_inpcb); - - /* Check arguments. */ - KASSERT(SEQ_LT(rcv_start, rcv_end), ("rcv_start < rcv_end")); - - /* SACK block for the received segment. */ - head_blk.start = rcv_start; - head_blk.end = rcv_end; - - /* - * Merge updated SACK blocks into head_blk, and save unchanged SACK - * blocks into saved_blks[]. num_saved will have the number of the - * saved SACK blocks. - */ - num_saved = 0; - for (i = 0; i < tp->rcv_numsacks; i++) { - tcp_seq start = tp->sackblks[i].start; - tcp_seq end = tp->sackblks[i].end; - if (SEQ_GEQ(start, end) || SEQ_LEQ(start, tp->rcv_nxt)) { - /* - * Discard this SACK block. - */ - } else if (SEQ_LEQ(head_blk.start, end) && - SEQ_GEQ(head_blk.end, start)) { - /* - * Merge this SACK block into head_blk. This SACK - * block itself will be discarded. - */ - if (SEQ_GT(head_blk.start, start)) - head_blk.start = start; - if (SEQ_LT(head_blk.end, end)) - head_blk.end = end; - } else { - /* - * Save this SACK block. - */ - saved_blks[num_saved].start = start; - saved_blks[num_saved].end = end; - num_saved++; - } - } - - /* - * Update SACK list in tp->sackblks[]. - */ - num_head = 0; - if (SEQ_GT(head_blk.start, tp->rcv_nxt)) { - /* - * The received data segment is an out-of-order segment. Put - * head_blk at the top of SACK list. - */ - tp->sackblks[0] = head_blk; - num_head = 1; - /* - * If the number of saved SACK blocks exceeds its limit, - * discard the last SACK block. - */ - if (num_saved >= MAX_SACK_BLKS) - num_saved--; - } - if (num_saved > 0) { - /* - * Copy the saved SACK blocks back. - */ - bcopy(saved_blks, &tp->sackblks[num_head], - sizeof(struct sackblk) * num_saved); - } - - /* Save the number of SACK blocks. */ - tp->rcv_numsacks = num_head + num_saved; + tcp_sackblock_zone = uma_zcreate("tcpsackblk", sizeof(struct tcp_sack_block), + NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); + uma_zone_set_max(tcp_sackblock_zone, tcp_sack_globalmaxholes); } -/* - * Delete all receiver-side SACK information. - */ -void -tcp_clean_sackreport(struct tcpcb *tp) +static __inline int +tcp_sack_cmp(struct tcp_sack_block *a, struct tcp_sack_block *b) { - int i; - - INP_LOCK_ASSERT(tp->t_inpcb); - tp->rcv_numsacks = 0; - for (i = 0; i < MAX_SACK_BLKS; i++) - tp->sackblks[i].start = tp->sackblks[i].end=0; + if (SEQ_LT(a->tsb_blk.end, b->tsb_blk.start)) + return (-1); + else if (SEQ_GT(a->tsb_blk.start, b->tsb_blk.end)) + return (1); + else + return (0); } -/* - * Allocate struct sackhole. - */ -static struct sackhole * -tcp_sackhole_alloc(struct tcpcb *tp, tcp_seq start, tcp_seq end) -{ - struct sackhole *hole; - - if (tp->snd_numholes >= tcp_sack_maxholes || - tcp_sack_globalholes >= tcp_sack_globalmaxholes) { - tcpstat.tcps_sack_sboverflow++; - return NULL; - } - - hole = (struct sackhole *)uma_zalloc(sack_hole_zone, M_NOWAIT); - if (hole == NULL) - return NULL; - - hole->start = start; - hole->end = end; - hole->rxmit = start; +RB_PROTOTYPE(tcp_sackblocks, tcp_sack_block, tsb_rb, tcp_sack_cmp); +RB_GENERATE(tcp_sackblocks, tcp_sack_block, tsb_rb, tcp_sack_cmp); - tp->snd_numholes++; - tcp_sack_globalholes++; - - return hole; -} - -/* - * Free struct sackhole. - */ static void -tcp_sackhole_free(struct tcpcb *tp, struct sackhole *hole) +tcp_sack_free(struct tcp_sack_block *tps, struct tcp_sack_block *tsb) { + struct tcp_sack_block *sb; - uma_zfree(sack_hole_zone, hole); - - tp->snd_numholes--; - tcp_sack_globalholes--; - - KASSERT(tp->snd_numholes >= 0, ("tp->snd_numholes >= 0")); - KASSERT(tcp_sack_globalholes >= 0, ("tcp_sack_globalholes >= 0")); + sb = RB_REMOVE(tcp_sackholes, tps, tsb); + KASSERT(sb != NULL, ("%s: RB_REMOVE failed", __func__)); + uma_zfree(tcp_sackblock_zone, tsb); } -/* - * Insert new SACK hole into scoreboard. - */ -static struct sackhole * -tcp_sackhole_insert(struct tcpcb *tp, tcp_seq start, tcp_seq end, - struct sackhole *after) +#ifdef INVARIANTS +static int +tcp_sack_verify(struct tcpcb *tp) { - struct sackhole *hole; + struct tcp_sack_block *tsb, *tsbn; - /* Allocate a new SACK hole. */ - hole = tcp_sackhole_alloc(tp, start, end); - if (hole == NULL) - return NULL; - - /* Insert the new SACK hole into scoreboard. */ - if (after != NULL) - TAILQ_INSERT_AFTER(&tp->snd_holes, after, hole, scblink); - else - TAILQ_INSERT_TAIL(&tp->snd_holes, hole, scblink); - - /* Update SACK hint. */ - if (tp->sackhint.nexthole == NULL) - tp->sackhint.nexthole = hole; - - return hole; + RB_FOREACH_SAFE(tsb, tcp_sackblocks, &tp->snd_sackblocks, tsbn) { + if (SEQ_GEQ(tsb->tsb_blk.start, tsb->tsb_blk.end) || + SEQ_LEQ(tsb->tsb_blk.start, tp->snd_una) || + SEQ_GT(tsb->tsb_blk.end, tp->snd_nxt) || + (tsbn != NULL && SEQ_GEQ(tsb->tsb_blk.end, tsbn->tsb_blk.start))) + return (0) + } + return (1); } - -/* - * Remove SACK hole from scoreboard. - */ -static void -tcp_sackhole_remove(struct tcpcb *tp, struct sackhole *hole) -{ +#endif - /* Update SACK hint. */ - if (tp->sackhint.nexthole == hole) - tp->sackhint.nexthole = TAILQ_NEXT(hole, scblink); - - /* Remove this SACK hole. */ - TAILQ_REMOVE(&tp->snd_holes, hole, scblink); - - /* Free this SACK hole. */ - tcp_sackhole_free(tp, hole); -} - -/* - * Process cumulative ACK and the TCP SACK option to update the scoreboard. - * tp->snd_holes is an ordered list of holes (oldest to newest, in terms of - * the sequence space). - */ int tcp_sack_doack(struct tcpcb *tp, struct tcpopt *to, tcp_seq th_ack) { - struct sackhole *cur, *temp; - struct sackblk sack, sack_blocks[TCP_MAX_SACK + 1], *sblkp; - int i, j, num_sack_blks; + int i; + int sacked = 0; /* the amount of newly sacked data */ + struct tcp_sack_block *tsb, *tsbn; + struct tcp_sack_block sack; - INP_LOCK_ASSERT(tp->t_inpcb); + /* Remove any blocks from the scoreboard when full acked. */ + RB_FOREACH_SAFE(tsb, tcp_sackblocks, &tp->snd_sackblocks, tsbn) { + if (SEQ_LT(th_ack, tqe->tsb_blk.start)) + break; + else + RB_REMOVE(tcp_sackblocks, &tp->snd_sackblocks, tsb); + } - num_sack_blks = 0; - /* - * If SND.UNA will be advanced by SEG.ACK, and if SACK holes exist, - * treat [SND.UNA, SEG.ACK) as if it is a SACK block. - */ - if (SEQ_LT(tp->snd_una, th_ack) && !TAILQ_EMPTY(&tp->snd_holes)) { - sack_blocks[num_sack_blks].start = tp->snd_una; - sack_blocks[num_sack_blks++].end = th_ack; - } - /* - * Append received valid SACK blocks to sack_blocks[], but only if we - * received new blocks from the other side. - */ - if (to->to_flags & TOF_SACK) { - for (i = 0; i < to->to_nsacks; i++) { - bcopy((to->to_sacks + i * TCPOLEN_SACK), - &sack, sizeof(sack)); - sack.start = ntohl(sack.start); - sack.end = ntohl(sack.end); - if (SEQ_GT(sack.end, sack.start) && - SEQ_GT(sack.start, tp->snd_una) && - SEQ_GT(sack.start, th_ack) && - SEQ_LT(sack.start, tp->snd_nxt) && - SEQ_GT(sack.end, tp->snd_una) && - SEQ_LEQ(sack.end, tp->snd_nxt)) - sack_blocks[num_sack_blks++] = sack; - } - } - /* - * Return if SND.UNA is not advanced and no valid SACK block is - * received. - */ - if (num_sack_blks == 0) + if ((to->t_flags & TOF_SACK) && to->to_nsacks == 0) { + /* remove all sack blocks, strange reneg */ + tcp_sack_flush(tp); + return (0); + } else if (!(to->t_flags & TOF_SACK)) return (0); - /* - * Sort the SACK blocks so we can update the scoreboard with just one - * pass. The overhead of sorting upto 4+1 elements is less than - * making upto 4+1 passes over the scoreboard. - */ - for (i = 0; i < num_sack_blks; i++) { - for (j = i + 1; j < num_sack_blks; j++) { - if (SEQ_GT(sack_blocks[i].end, sack_blocks[j].end)) { - sack = sack_blocks[i]; - sack_blocks[i] = sack_blocks[j]; - sack_blocks[j] = sack; - } - } - } - if (TAILQ_EMPTY(&tp->snd_holes)) - /* - * Empty scoreboard. Need to initialize snd_fack (it may be - * uninitialized or have a bogus value). Scoreboard holes - * (from the sack blocks received) are created later below - * (in the logic that adds holes to the tail of the - * scoreboard). - */ - tp->snd_fack = SEQ_MAX(tp->snd_una, th_ack); - /* - * In the while-loop below, incoming SACK blocks (sack_blocks[]) and - * SACK holes (snd_holes) are traversed from their tails with just - * one pass in order to reduce the number of compares especially when - * the bandwidth-delay product is large. - * - * Note: Typically, in the first RTT of SACK recovery, the highest - * three or four SACK blocks with the same ack number are received. - * In the second RTT, if retransmitted data segments are not lost, - * the highest three or four SACK blocks with ack number advancing - * are received. - */ - sblkp = &sack_blocks[num_sack_blks - 1]; /* Last SACK block */ - if (SEQ_LT(tp->snd_fack, sblkp->start)) { - /* - * The highest SACK block is beyond fack. Append new SACK - * hole at the tail. If the second or later highest SACK - * blocks are also beyond the current fack, they will be - * inserted by way of hole splitting in the while-loop below. - */ - temp = tcp_sackhole_insert(tp, tp->snd_fack,sblkp->start,NULL); - if (temp != NULL) { - tp->snd_fack = sblkp->end; - /* Go to the previous sack block. */ - sblkp--; - } else { - /* - * We failed to add a new hole based on the current - * sack block. Skip over all the sack blocks that - * fall completely to the right of snd_fack and - * proceed to trim the scoreboard based on the - * remaining sack blocks. This also trims the - * scoreboard for th_ack (which is sack_blocks[0]). - */ - while (sblkp >= sack_blocks && - SEQ_LT(tp->snd_fack, sblkp->start)) - sblkp--; - if (sblkp >= sack_blocks && - SEQ_LT(tp->snd_fack, sblkp->end)) - tp->snd_fack = sblkp->end; - } - } else if (SEQ_LT(tp->snd_fack, sblkp->end)) - /* fack is advanced. */ - tp->snd_fack = sblkp->end; - /* We must have at least one SACK hole in scoreboard. */ - KASSERT(!TAILQ_EMPTY(&tp->snd_holes), - ("SACK scoreboard must not be empty")); - cur = TAILQ_LAST(&tp->snd_holes, sackhole_head); /* Last SACK hole. */ - /* - * Since the incoming sack blocks are sorted, we can process them - * making one sweep of the scoreboard. - */ - while (sblkp >= sack_blocks && cur != NULL) { - if (SEQ_GEQ(sblkp->start, cur->end)) { - /* - * SACKs data beyond the current hole. Go to the - * previous sack block. - */ - sblkp--; + /* Integrate SACK blocks from segment. */ + for (i = 0; i < to->to_nsacks; i++) { + /* Copy SACK blocks from options section of TCP header. */ + bcopy((to->to_sacks + i * TCPOLEN_SACK), + &sack.tsb_blk, sizeof(sack.tsb_blk)); + sack.tsb_blk.start = ntohl(sack.tsb_blk.start); + sack.tsb_blk.end = ntohl(sack.tsb_blk.end); + + /* Validity checks on SACK blocks as received from sender. */ + if (SEQ_GT(sack.tsb_blk.start, sack.tsb_blk.end) || + SEQ_LEQ(sack.tsb_blk.start, th_ack) || + SEQ_GT(sack.tsb_blk.end, tp->snd_nxt)) continue; - } - if (SEQ_LEQ(sblkp->end, cur->start)) { - /* - * SACKs data before the current hole. Go to the - * previous hole. - */ - cur = TAILQ_PREV(cur, sackhole_head, scblink); + + /* XXXAO: Implicit-explicit reneg. */ + if (sack.start == sack.end) { + /* Remove all sackblocks. */ + tcp_sack_flush(tp); continue; } - tp->sackhint.sack_bytes_rexmit -= (cur->rxmit - cur->start); - KASSERT(tp->sackhint.sack_bytes_rexmit >= 0, - ("sackhint bytes rtx >= 0")); - if (SEQ_LEQ(sblkp->start, cur->start)) { - /* Data acks at least the beginning of hole. */ - if (SEQ_GEQ(sblkp->end, cur->end)) { - /* Acks entire hole, so delete hole. */ - temp = cur; - cur = TAILQ_PREV(cur, sackhole_head, scblink); - tcp_sackhole_remove(tp, temp); - /* - * The sack block may ack all or part of the - * next hole too, so continue onto the next - * hole. - */ + + /* Return match that has at least partial overlap to either side. */ + if ((tsb = RB_FIND(tcp_sackblocks, &tp->snd_sackblocks, &sack)) != NULL) { + /* within a block, was a duplicate retransmit, D-SACK */ + if (SEQ_GEQ(sack.tsb_blk.start, tsb->tsb_blk.start) && + SEQ_LEQ(sack.tsb_blk.end, tsb->tsb_blk.end)) { continue; - } else { - /* Move start of hole forward. */ - cur->start = sblkp->end; - cur->rxmit = SEQ_MAX(cur->rxmit, cur->start); + } + /* Extends the end, common case. */ + if (SEQ_GT(sack.tsb_blk.end, tsb->tsb_blk.end)) { + sacked += SEQ_DELTA(tsb->tsb_blk.end, sack.tsb_blk.end); + tsb->tsb_blk.end = sack.tsb_blk.end; + while ((tsbn = RB_NEXT(tcp_sackblocks, &tp->snd_sackblocks, tsb)) != NULL && + SEQ_GEQ(tsbn->tsb_blk.start, tsb->tsb_blk.end)) { + //sacked -= SEQ_DELTA(sack.tsb_blk.start, tsbn->tsb_blk.start); + if (SEQ_GT(tsbn->tsb_blk.end, tsb->tsb_blk.end)) + tsb->tsb_blk.end = tsbn->tsb_blk.end; + tcp_sack_free(&tp->snd_sackblocks, tsbn); + } } - } else { - /* Data acks at least the end of hole. */ - if (SEQ_GEQ(sblkp->end, cur->end)) { - /* Move end of hole backward. */ - cur->end = sblkp->start; - cur->rxmit = SEQ_MIN(cur->rxmit, cur->end); - } else { - /* - * ACKs some data in middle of a hole; need - * to split current hole - */ - temp = tcp_sackhole_insert(tp, sblkp->end, - cur->end, cur); - if (temp != NULL) { - if (SEQ_GT(cur->rxmit, temp->rxmit)) { - temp->rxmit = cur->rxmit; - tp->sackhint.sack_bytes_rexmit - += (temp->rxmit - - temp->start); - } - cur->end = sblkp->start; - cur->rxmit = SEQ_MIN(cur->rxmit, - cur->end); + /* Extends the start. */ + if (SEQ_LT(sack.tsb_blk.start, tsb->tsb_blk.start)) { + sacked += SEQ_DELTA(sack.tsb_blk.start, tsb->tsb_blk.start); + tsb->tsb_blk.start = sack.tsb_blk.start; + while ((tsbn = RB_PREV(tcp_sackblocks, &tp->snd_sackblocks, tsb)) != NULL && + SEQ_GEQ(tsbn->tsb_blk.end, tsb->tsb_blk.start)) { + //sacked -= SEQ_DELTA(); + if (SEQ_LT(tsbn->tsb_blk.start, tsb->tsb_blk.start)) + tsb->tsb_blk.start = tsbn->tsb_blk.start; + tcp_sack_free(&tp->snd_sackblocks, tsbn); } } - } - tp->sackhint.sack_bytes_rexmit += (cur->rxmit - cur->start); - /* - * Testing sblkp->start against cur->start tells us whether - * we're done with the sack block or the sack hole. - * Accordingly, we advance one or the other. - */ - if (SEQ_LEQ(sblkp->start, cur->start)) - cur = TAILQ_PREV(cur, sackhole_head, scblink); - else - sblkp--; + } else if ((tsb = (struct sackblocks *)uma_zalloc(tcp_sackblock_zone, M_NOWAIT))) != NULL) { + sacked += SEQ_DELTA(sack.tsb_blk.start, sack.tsb_blk.end); + tsb->tsb_blk.start = sack.tsb_blk.start; + tsb->tsb_blk.end = sack.tsb_blk.end; + tsbn = RB_INSERT(tcp_sackblocks, &tp->snd_sackblocks, tsb); + KASSERT(tsbn == NULL, ("%s: RB_INSERT failed", __func__)); + } else + TCPSTAT_INC(); /* failed to allocate sackblock */ } - return (0); -} -/* - * Free all SACK holes to clear the scoreboard. - */ -void -tcp_free_sackholes(struct tcpcb *tp) -{ - struct sackhole *q; + KASSERT(tcp_sack_verify(tp), + ("%s: snd_sackblocks RB tree inconsistent", __func__)); - INP_LOCK_ASSERT(tp->t_inpcb); - while ((q = TAILQ_FIRST(&tp->snd_holes)) != NULL) - tcp_sackhole_remove(tp, q); - tp->sackhint.sack_bytes_rexmit = 0; - - KASSERT(tp->snd_numholes == 0, ("tp->snd_numholes == 0")); - KASSERT(tp->sackhint.nexthole == NULL, - ("tp->sackhint.nexthole == NULL")); + return (sacked); } -/* - * Partial ack handling within a sack recovery episode. Keeping this very - * simple for now. When a partial ack is received, force snd_cwnd to a value - * that will allow the sender to transmit no more than 2 segments. If - * necessary, a better scheme can be adopted at a later point, but for now, - * the goal is to prevent the sender from bursting a large amount of data in - * the midst of sack recovery. - */ void -tcp_sack_partialack(struct tcpcb *tp, struct tcphdr *th) +tcp_sack_flush(struct tcpcb *tp) { - int num_segs = 1; + struct tcp_sack_block *tsb, *tsbn; - INP_LOCK_ASSERT(tp->t_inpcb); - tcp_timer_activate(tp, TT_REXMT, 0); - //tp->t_rtttime = 0; - /* Send one or 2 segments based on how much new data was acked. */ - if (((th->th_ack - tp->snd_una) / tp->snd_mss) > 2) - num_segs = 2; - tp->snd_cwnd = (tp->sackhint.sack_bytes_rexmit + - (tp->snd_nxt - tp->sack_newdata) + num_segs * tp->snd_mss); - if (tp->snd_cwnd > tp->snd_ssthresh) - tp->snd_cwnd = tp->snd_ssthresh; - tp->t_flags |= TF_ACKNOW; - (void) tcp_output(tp); -} - -#if 0 -/* - * Debug version of tcp_sack_output() that walks the scoreboard. Used for - * now to sanity check the hint. - */ -static struct sackhole * -tcp_sack_output_debug(struct tcpcb *tp, int *sack_bytes_rexmt) -{ - struct sackhole *p; - - INP_LOCK_ASSERT(tp->t_inpcb); - *sack_bytes_rexmt = 0; - TAILQ_FOREACH(p, &tp->snd_holes, scblink) { - if (SEQ_LT(p->rxmit, p->end)) { - if (SEQ_LT(p->rxmit, tp->snd_una)) {/* old SACK hole */ - continue; - } - *sack_bytes_rexmt += (p->rxmit - p->start); - break; - } - *sack_bytes_rexmt += (p->rxmit - p->start); + RB_FOREACH_SAFE(tsb, tcp_sackholes, &tp->snd_sackholes, tsbn) { + RB_REMOVE(tcp_sackholes, &tp->snd_sackholes, tsb); + uma_zfree(tcp_sackbock_zone, tsb); } - return (p); } -#endif -/* - * Returns the next hole to retransmit and the number of retransmitted bytes - * from the scoreboard. We store both the next hole and the number of - * retransmitted bytes as hints (and recompute these on the fly upon SACK/ACK - * reception). This avoids scoreboard traversals completely. - * - * The loop here will traverse *at most* one link. Here's the argument. For - * the loop to traverse more than 1 link before finding the next hole to - * retransmit, we would need to have at least 1 node following the current - * hint with (rxmit == end). But, for all holes following the current hint, - * (start == rxmit), since we have not yet retransmitted from them. - * Therefore, in order to traverse more 1 link in the loop below, we need to - * have at least one node following the current hint with (start == rxmit == - * end). But that can't happen, (start == end) means that all the data in - * that hole has been sacked, in which case, the hole would have been removed - * from the scoreboard. - */ -struct sackhole * -tcp_sack_output(struct tcpcb *tp, int *sack_bytes_rexmt) +#ifdef DDB +static void +db_print_sackblocks(struct tcpcb *tp) { - struct sackhole *hole = NULL; + struct tcp_sack_block *tsb; - INP_LOCK_ASSERT(tp->t_inpcb); - *sack_bytes_rexmt = tp->sackhint.sack_bytes_rexmit; - hole = tp->sackhint.nexthole; - if (hole == NULL || SEQ_LT(hole->rxmit, hole->end)) - goto out; - while ((hole = TAILQ_NEXT(hole, scblink)) != NULL) { - if (SEQ_LT(hole->rxmit, hole->end)) { - tp->sackhint.nexthole = hole; - break; - } + RB_FOREACH(tsb, tcp_sackblocks, &tp->snd_sackblocks) { + db_printf(" sack block 0x%08x - 0x%08x\n", + tsb->tsb_blk.start, tsb->tsb_blk.end) } -out: - return (hole); } - -/* - * After a timeout, the SACK list may be rebuilt. This SACK information - * should be used to avoid retransmitting SACKed data. This function - * traverses the SACK list to see if snd_nxt should be moved forward. - */ -void -tcp_sack_adjust(struct tcpcb *tp) -{ - struct sackhole *p, *cur = TAILQ_FIRST(&tp->snd_holes); - - INP_LOCK_ASSERT(tp->t_inpcb); - if (cur == NULL) - return; /* No holes */ - if (SEQ_GEQ(tp->snd_nxt, tp->snd_fack)) - return; /* We're already beyond any SACKed blocks */ - /*- - * Two cases for which we want to advance snd_nxt: - * i) snd_nxt lies between end of one hole and beginning of another - * ii) snd_nxt lies between end of last hole and snd_fack - */ - while ((p = TAILQ_NEXT(cur, scblink)) != NULL) { - if (SEQ_LT(tp->snd_nxt, cur->end)) - return; - if (SEQ_GEQ(tp->snd_nxt, p->start)) - cur = p; - else { - tp->snd_nxt = p->start; - return; - } - } - if (SEQ_LT(tp->snd_nxt, cur->end)) - return; - tp->snd_nxt = tp->snd_fack; - return; -} +#endif ==== //depot/projects/tcp_new/netinet/tcp_subr.c#6 (text+ko) ==== @@ -161,8 +161,6 @@ SYSCTL_INT(_net_inet_tcp, OID_AUTO, isn_reseed_interval, CTLFLAG_RW, &tcp_isn_reseed_interval, 0, "Seconds between reseeding of ISN secret"); -uma_zone_t sack_hole_zone; - static struct inpcb *tcp_notify(struct inpcb *, int); static void tcp_isn_tick(void *); @@ -270,13 +268,12 @@ syncache_init(); tcp_hc_init(); tcp_reass_init(); + tcp_sack_init(); ISN_LOCK_INIT(); callout_init(&isn_callout, CALLOUT_MPSAFE); tcp_isn_tick(NULL); EVENTHANDLER_REGISTER(shutdown_pre_sync, tcp_fini, NULL, SHUTDOWN_PRI_DEFAULT); - sack_hole_zone = uma_zcreate("sackhole", sizeof(struct sackhole), - NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); EVENTHANDLER_REGISTER(maxsockets_change, tcp_zone_change, NULL, EVENTHANDLER_PRI_ANY); } @@ -722,7 +719,7 @@ /* Disconnect offload device, if any. */ tcp_offload_detach(tp); - tcp_free_sackholes(tp); + tcp_sack_flush(tp); inp->inp_ppcb = NULL; tp->t_inpcb = NULL; uma_zfree(tcpcb_zone, tp); @@ -794,7 +791,7 @@ tcpb->t_segqlen--; tcp_reass_qsize--; } - tcp_clean_sackreport(tcpb); + tcp_sack_flush(tcpb); } INP_UNLOCK(inpb); } @@ -1530,7 +1527,7 @@ tcpstat.tcps_mturesent++; //tp->t_rtttime = 0; tp->snd_nxt = tp->snd_una; - tcp_free_sackholes(tp); + tcp_sack_flush(tp); tp->snd_recover = tp->snd_nxt; //if (tp->t_flags & TF_SACK_PERMIT) // EXIT_FASTRECOVERY(tp); ==== //depot/projects/tcp_new/netinet/tcp_var.h#11 (text+ko) ==== @@ -96,15 +96,13 @@ extern struct uma_zone *tcp_reass_zone; struct sackblk { - tcp_seq start; /* start seq no. of sack block */ - tcp_seq end; /* end seq no. */ + tcp_seq start; /* left */ + tcp_seq end; /* right */ }; -struct sackhole { - tcp_seq start; /* start seq no. of hole */ - tcp_seq end; /* end seq no. */ - tcp_seq rxmit; /* next seq. no in hole to be retransmitted */ - TAILQ_ENTRY(sackhole) scblink; /* scoreboard linkage */ +struct tcp_sack_block { + RB_ENTRY(tcp_sackblocks) tsb_rb; /* scoreboard linkage */ + struct sackblk tsb_blk; }; struct sackhint { @@ -252,15 +250,8 @@ #define TCPOOB_HADDATA 0x02 /* SACK related state */ + RB_HEAD(tcp_sackblocks, tcp_sack_block) snd_sackblocks; int snd_numholes; /* number of holes seen by sender */ - TAILQ_HEAD(sackhole_head, sackhole) snd_holes; - /* SACK scoreboard (sorted) */ - tcp_seq snd_fack; /* last seq number(+1) sack'd by rcv'r */ - int rcv_numsacks; /* XXXAO */ - struct sackblk sackblks[MAX_SACK_BLKS]; /* seq nos. of sack blocks */ - tcp_seq sack_newdata; /* New data xmitted in this recovery - episode starts at this seq number */ - struct sackhint sackhint; /* SACK scoreboard hint */ int snd_sacked; /* data currently ack'ed through SACK */ /* Congestion control algorithms */ @@ -651,12 +642,9 @@ tcp_seq tcp_new_isn(struct tcpcb *); int tcp_sack_doack(struct tcpcb *, struct tcpopt *, tcp_seq); -void tcp_update_sack_list(struct tcpcb *tp, tcp_seq rcv_laststart, tcp_seq rcv_lastend); -void tcp_clean_sackreport(struct tcpcb *tp); -void tcp_sack_adjust(struct tcpcb *tp); -struct sackhole *tcp_sack_output(struct tcpcb *tp, int *sack_bytes_rexmt); -void tcp_sack_partialack(struct tcpcb *, struct tcphdr *); -void tcp_free_sackholes(struct tcpcb *tp); +void tcp_sack_flush(struct tcpcb *tp); +void tcp_sack_init(void); + int tcp_newreno(struct tcpcb *, struct tcphdr *); u_long tcp_seq_subtract(u_long, u_long );