From owner-svn-src-all@FreeBSD.ORG Sat May 14 10:59:09 2011 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D5256106564A; Sat, 14 May 2011 10:59:09 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id AB3158FC1C; Sat, 14 May 2011 10:59:08 +0000 (UTC) Received: by bwz12 with SMTP id 12so3936641bwz.13 for ; Sat, 14 May 2011 03:59:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:from:to:cc:subject:references:x-comment-to :sender:date:in-reply-to:message-id:user-agent:mime-version :content-type:content-transfer-encoding; bh=SSIq+Y/Jqi5Uy+tDpOe3eQFjdLE+SMpq+Sdq89XwMvA=; b=VYVo1Mvtxq0X7xG+Bjckrak6Uzfo80WyW5Dm88klxDmeMNYHuAaJWDj6VDD3F8jUw8 YT3QfYR7H+mn5UNp3+0Cbg3FCOorYyB8RwNwQj/gQ9er9vc1iKZiY61D7JNSg4YyyMi2 pa446nPmzoHG1ir6DrvGuZKvjevtbZmWQ7E7Y= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:cc:subject:references:x-comment-to:sender:date:in-reply-to :message-id:user-agent:mime-version:content-type :content-transfer-encoding; b=jS/E3J9myeqZOoJOYb42Ep8IAU09MWcBe5GwE6sWq45A/2qEQrhajaKkm/3G9z0Tbw BjYQi00pBVxRQrPXLfUG6ubr9EJzlDYbdMGzcXdiArsUGEIeAq/buDh1jl2yJY2bHwwU 5RtN9JkAjSF0LKR9r974B/hqfCfcbIzxwpLjw= Received: by 10.204.143.129 with SMTP id v1mr662177bku.175.1305369446140; Sat, 14 May 2011 03:37:26 -0700 (PDT) Received: from localhost ([95.69.172.154]) by mx.google.com with ESMTPS id q18sm1854531bka.3.2011.05.14.03.37.21 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 14 May 2011 03:37:22 -0700 (PDT) From: Mikolaj Golub To: John Baldwin References: <201105022105.p42L5q3j054498@svn.freebsd.org> X-Comment-To: John Baldwin Sender: Mikolaj Golub Date: Sat, 14 May 2011 13:37:19 +0300 In-Reply-To: <201105022105.p42L5q3j054498@svn.freebsd.org> (John Baldwin's message of "Mon, 2 May 2011 21:05:52 +0000 (UTC)") Message-ID: <86pqnlbmao.fsf@kopusha.home.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 8bit Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r221346 - head/sys/netinet X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 May 2011 10:59:09 -0000 Hi, On Mon, 2 May 2011 21:05:52 +0000 (UTC) John Baldwin wrote: JB> Author: jhb JB> Date: Mon May 2 21:05:52 2011 JB> New Revision: 221346 JB> URL: http://svn.freebsd.org/changeset/base/221346 JB> Log: JB> Handle a rare edge case with nearly full TCP receive buffers. If a TCP JB> buffer fills up causing the remote sender to enter into persist mode, but JB> there is still room available in the receive buffer when a window probe JB> arrives (either due to window scaling, or due to the local application JB> very slowing draining data from the receive buffer), then the single byte JB> of data in the window probe is accepted. However, this can cause rcv_nxt JB> to be greater than rcv_adv. This condition will only last until the next JB> ACK packet is pushed out via tcp_output(), and since the previous ACK JB> advertised a zero window, the ACK should be pushed out while the TCP JB> pcb is write-locked. JB> JB> During the window while rcv_nxt is greather than rcv_adv, a few places JB> would compute the remaining receive window via rcv_adv - rcv_nxt. JB> However, this value was then (uint32_t)-1. On a 64 bit machine this JB> could expand to a positive 2^32 - 1 when cast to a long. In particular, JB> when calculating the receive window in tcp_output(), the result would be JB> that the receive window was computed as 2^32 - 1 resulting in advertising JB> a far larger window to the remote peer than actually existed. JB> JB> Fix various places that compute the remaining receive window to either JB> assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the JB> window as full if rcv_nxt is greather than rcv_adv. JB> JB> Reviewed by: bz JB> MFC after: 1 month JB> Modified: JB> head/sys/netinet/tcp_input.c JB> head/sys/netinet/tcp_output.c JB> head/sys/netinet/tcp_timewait.c JB> Modified: head/sys/netinet/tcp_input.c JB> ============================================================================== JB> --- head/sys/netinet/tcp_input.c Mon May 2 21:04:37 2011 (r221345) JB> +++ head/sys/netinet/tcp_input.c Mon May 2 21:05:52 2011 (r221346) JB> @@ -1831,6 +1831,9 @@ tcp_do_segment(struct mbuf *m, struct tc JB> win = sbspace(&so->so_rcv); JB> if (win < 0) JB> win = 0; JB> + KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt), JB> + ("tcp_input negative window: tp %p rcv_nxt %u rcv_adv %u", tp, JB> + tp->rcv_adv, tp->rcv_nxt)); I am getting this when running tests with HAST (both primary and secondary HAST instances on the same host). HAST is synchronizing data in MAXPHYS (131072 bytes) blocks. The sender splits them on smaller chunks of MAX_SEND_SIZE (32768 bytes), while the receiver receives the whole block calling recv() with MSG_WAITALL option. FreeBSD kopusha.home.net 9.0-CURRENT FreeBSD 9.0-CURRENT #6 r221878: Sat May 14 11:44:42 EEST 2011 root@kopusha.home.net:/usr/obj/usr/src/sys/GENERIC i386 panic: tcp_input negative window: tp 0xc9777ad0 rcv_nxt 1530593650 rcv_adv 1530593651 #0 doadump () at pcpu.h:244 #1 0xc04ddac9 in db_fncall (dummy1=-1063410006, dummy2=0, dummy3=-1, dummy4=0xe67547f0 "\004Huæ") at /usr/src/sys/ddb/db_command.c:548 #2 0xc04ddeff in db_command (last_cmdp=0xc0fcdbfc, cmd_table=0x0, dopager=0) at /usr/src/sys/ddb/db_command.c:445 #3 0xc04ddfb4 in db_command_script (command=0xc0fceb04 "call doadump") at /usr/src/sys/ddb/db_command.c:516 #4 0xc04e2280 in db_script_exec (scriptname=0xe67548fc "kdb.enter.panic", warnifnotfound=Variable "warnifnotfound" is not available. ) at /usr/src/sys/ddb/db_script.c:302 #5 0xc04e2367 in db_script_kdbenter (eventname=0xc0e7d4ee "panic") at /usr/src/sys/ddb/db_script.c:324 #6 0xc04dff98 in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:228 #7 0xc09da822 in kdb_trap (type=3, code=0, tf=0xe6754a40) at /usr/src/sys/kern/subr_kdb.c:533 #8 0xc0ccc2eb in trap (frame=0xe6754a40) at /usr/src/sys/i386/i386/trap.c:719 #9 0xc0cb4fec in calltrap () at /usr/src/sys/i386/i386/exception.s:168 #10 0xc09da6aa in kdb_enter (why=0xc0e7d4ee "panic", msg=0xc0e7d4ee "panic") at cpufunc.h:71 #11 0xc09a5db4 in panic ( fmt=0xc0ea2d6b "tcp_input negative window: tp %p rcv_nxt %u rcv_adv %u") at /usr/src/sys/kern/kern_shutdown.c:584 #12 0xc0b29cdb in tcp_do_segment (m=0xc9aac800, th=0xc9aac874, so=0xc9e0b680, tp=0xc9777ad0, drop_hdrlen=52, tlen=1, iptos=0 '\0', ti_locked=3) ---Type to continue, or q to quit--- at /usr/src/sys/netinet/tcp_input.c:1834 #13 0xc0b2d309 in tcp_input (m=0xc9aac800, off0=20) at /usr/src/sys/netinet/tcp_input.c:1369 #14 0xc0ac4676 in ip_input (m=0xc9aac800) at /usr/src/sys/netinet/ip_input.c:765 #15 0xc0a6948a in swi_net (arg=0xc1425880) at /usr/src/sys/net/netisr.c:653 #16 0xc097c675 in intr_event_execute_handlers (p=0xc65bf578, ie=0xc6608280) at /usr/src/sys/kern/kern_intr.c:1257 #17 0xc097d559 in ithread_loop (arg=0xc65836a0) at /usr/src/sys/kern/kern_intr.c:1270 #18 0xc0979928 in fork_exit (callout=0xc097d4b0 , arg=0xc65836a0, frame=0xe6754d28) at /usr/src/sys/kern/kern_fork.c:920 #19 0xc0cb5064 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:275 (kgdb) fr 12 #12 0xc0b29cdb in tcp_do_segment (m=0xc9aac800, th=0xc9aac874, so=0xc9e0b680, tp=0xc9777ad0, drop_hdrlen=52, tlen=1, iptos=0 '\0', ti_locked=3) at /usr/src/sys/netinet/tcp_input.c:1834 1834 KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt), (kgdb) p *tp $1 = { t_segq = { lh_first = 0x0 }, t_pspare = {0x0, 0x0}, t_segqlen = 0, t_dupacks = 0, t_timers = 0xc9777cbc, t_inpcb = 0xc8c4cdc0, t_state = 9, t_flags = 525300, t_vnet = 0x0, snd_una = 1602773057, snd_max = 1602773057, snd_nxt = 1602773057, snd_up = 1602773057, snd_wl1 = 1530593650, snd_wl2 = 1602773057, iss = 1602772219, irs = 1499210531, rcv_nxt = 1530593651, rcv_adv = 1530593650, rcv_wnd = 671, ---Type to continue, or q to quit--- rcv_up = 1530593650, snd_wnd = 70872, snd_cwnd = 29510, snd_spare1 = 0, snd_ssthresh = 1073725440, snd_spare2 = 0, snd_recover = 1602773057, t_maxopd = 16344, t_rcvtime = 175861, t_starttime = 171604, t_rtttime = 0, t_rtseq = 1602772220, t_bw_spare1 = 0, t_bw_spare2 = 0, t_rxtcur = 31, t_maxseg = 14336, t_srtt = 56, t_rttvar = 37, t_rxtshift = 0, t_rttmin = 3, t_rttbest = 40, t_rttupdated = 3, max_sndwnd = 71680, ---Type to continue, or q to quit--- t_softerror = 0, t_oobflags = 0 '\0', t_iobc = 0 '\0', snd_scale = 3 '\003', rcv_scale = 3 '\003', request_r_scale = 3 '\003', ts_recent = 175361, ts_recent_age = 175361, ts_offset = 2326744134, last_ack_sent = 1530593651, snd_cwnd_prev = 0, snd_ssthresh_prev = 0, snd_recover_prev = 0, t_sndzerowin = 4, t_badrxtwin = 0, snd_limited = 0 '\0', snd_numholes = 0, snd_holes = { tqh_first = 0x0, tqh_last = 0xc9777bb8 }, snd_fack = 0, rcv_numsacks = 0, ---Type to continue, or q to quit--- sackblks = {{ start = 0, end = 0 }, { start = 0, end = 0 }, { start = 0, end = 0 }, { start = 0, end = 0 }, { start = 0, end = 0 }, { start = 0, end = 0 }}, sack_newdata = 0, sackhint = { nexthole = 0x0, sack_bytes_rexmit = 0, ---Type to continue, or q to quit--- last_sack_ack = 0, ispare = 0, _pad = {0, 0} }, t_rttlow = 0, rfbuf_ts = 175361, rfbuf_cnt = 0, t_tu = 0x0, t_sndrexmitpack = 0, t_rcvoopack = 0, t_toe = 0x0, t_bytes_acked = 0, cc_algo = 0xc0fa6960, ccv = 0xc9777d5c, osd = 0xc9777d74, t_ispare = 0, t_pspare2 = {0x0, 0x0, 0x0, 0x0}, _pad = {0 } } (kgdb) p *so $2 = { so_count = 1, so_type = 1, so_options = 4, so_linger = 0, so_state = 2, so_qstate = 0, so_pcb = 0xc8c4cdc0, so_vnet = 0x0, so_proto = 0xc0fa56a8, so_head = 0x0, so_incomp = { tqh_first = 0x0, tqh_last = 0x0 }, so_comp = { tqh_first = 0x0, tqh_last = 0x0 }, so_list = { tqe_next = 0x0, tqe_prev = 0xc9e0ab88 }, ---Type to continue, or q to quit--- so_qlen = 0, so_incqlen = 0, so_qlimit = 0, so_timeo = 0, so_error = 0, so_sigio = 0x0, so_oobmark = 0, so_aiojobq = { tqh_first = 0x0, tqh_last = 0xc9e0b6cc }, so_rcv = { sb_sel = { si_tdlist = { tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xc0970a90 , kl_unlock = 0xc0970ac0 , ---Type to continue, or q to quit--- kl_assert_locked = 0xc0970e10 , kl_assert_unlocked = 0xc0970de0 , kl_lockarg = 0xc9e0b6f8 }, si_mtx = 0x0 }, sb_mtx = { lock_object = { lo_name = 0xc0e84d6b "so_rcv", lo_flags = 16973824, lo_data = 0, lo_witness = 0xc655a618 }, mtx_lock = 4 }, sb_sx = { lock_object = { lo_name = 0xc0e75c2c "so_rcv_sx", lo_flags = 36896768, lo_data = 0, lo_witness = 0xc65634b0 }, sx_lock = 3356203168 ---Type to continue, or q to quit--- }, sb_state = 0, sb_mb = 0xc9aab800, sb_mbtail = 0xc88f0000, sb_lastrecord = 0xc9aab800, sb_sndptr = 0x0, sb_sndptroff = 0, sb_cc = 71010, sb_hiwat = 71680, sb_mbcnt = 85760, sb_mcnt = 23, sb_ccnt = 20, sb_mbmax = 262144, sb_ctl = 0, sb_lowat = 1, sb_timeo = 2000, sb_flags = 2052, sb_upcall = 0, sb_upcallarg = 0x0 }, so_snd = { sb_sel = { si_tdlist = { ---Type to continue, or q to quit--- tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xc0970a90 , kl_unlock = 0xc0970ac0 , kl_assert_locked = 0xc0970e10 , kl_assert_unlocked = 0xc0970de0 , kl_lockarg = 0xc9e0b78c }, si_mtx = 0x0 }, sb_mtx = { lock_object = { lo_name = 0xc0e84d64 "so_snd", lo_flags = 16973824, lo_data = 0, lo_witness = 0xc655a5b0 }, mtx_lock = 4 ---Type to continue, or q to quit--- }, sb_sx = { lock_object = { lo_name = 0xc0e75c22 "so_snd_sx", lo_flags = 36896768, lo_data = 0, lo_witness = 0xc6563448 }, sx_lock = 1 }, sb_state = 16, sb_mb = 0x0, sb_mbtail = 0x0, sb_lastrecord = 0x0, sb_sndptr = 0x0, sb_sndptroff = 0, sb_cc = 0, sb_hiwat = 43008, sb_mbcnt = 0, sb_mcnt = 0, sb_ccnt = 0, sb_mbmax = 262144, sb_ctl = 0, ---Type to continue, or q to quit--- sb_lowat = 2048, sb_timeo = 2000, sb_flags = 2048, sb_upcall = 0, sb_upcallarg = 0x0 }, so_cred = 0xc879b700, so_label = 0x0, so_peerlabel = 0x0, so_gencnt = 1987, so_emuldata = 0x0, so_accf = 0x0, so_fibnum = 0, so_user_cookie = 0 } (kgdb) JB> tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt)); JB> JB> /* Reset receive buffer auto scaling when not in bulk receive mode. */ JB> @@ -2868,7 +2871,10 @@ dodata: /* XXX */ JB> * buffer size. JB> * XXX: Unused. JB> */ JB> - len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt); JB> + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt)) JB> + len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt); JB> + else JB> + len = so->so_rcv.sb_hiwat; JB> #endif JB> } else { JB> m_freem(m); JB> Modified: head/sys/netinet/tcp_output.c JB> ============================================================================== JB> --- head/sys/netinet/tcp_output.c Mon May 2 21:04:37 2011 (r221345) JB> +++ head/sys/netinet/tcp_output.c Mon May 2 21:05:52 2011 (r221346) JB> @@ -561,15 +561,21 @@ after_sack_rexmit: JB> * taking into account that we are limited by JB> * TCP_MAXWIN << tp->rcv_scale. JB> */ JB> - long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - JB> - (tp->rcv_adv - tp->rcv_nxt); JB> + long adv; JB> + int oldwin; JB> + JB> + adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale); JB> + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt)) { JB> + oldwin = (tp->rcv_adv - tp->rcv_nxt); JB> + adv -= oldwin; JB> + } else JB> + oldwin = 0; JB> JB> /* JB> * If the new window size ends up being the same as the old JB> * size when it is scaled, then don't force a window update. JB> */ JB> - if ((tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale == JB> - (adv + tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale) JB> + if (oldwin >> tp->rcv_scale == (adv + oldwin) >> tp->rcv_scale) JB> goto dontupdate; JB> if (adv >= (long) (2 * tp->t_maxseg)) JB> goto send; JB> @@ -1008,7 +1014,8 @@ send: JB> if (recwin < (long)(so->so_rcv.sb_hiwat / 4) && JB> recwin < (long)tp->t_maxseg) JB> recwin = 0; JB> - if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) JB> + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) && JB> + recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) JB> recwin = (long)(tp->rcv_adv - tp->rcv_nxt); JB> if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) JB> recwin = (long)TCP_MAXWIN << tp->rcv_scale; JB> Modified: head/sys/netinet/tcp_timewait.c JB> ============================================================================== JB> --- head/sys/netinet/tcp_timewait.c Mon May 2 21:04:37 2011 (r221345) JB> +++ head/sys/netinet/tcp_timewait.c Mon May 2 21:05:52 2011 (r221346) JB> @@ -242,6 +242,9 @@ tcp_twstart(struct tcpcb *tp) JB> /* JB> * Recover last window size sent. JB> */ JB> + KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt), JB> + ("tcp_twstart negative window: tp %p rcv_nxt %u rcv_adv %u", tp, JB> + tp->rcv_adv, tp->rcv_nxt)); JB> tw->last_win = (tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale; JB> JB> /* -- Mikolaj Golub