From owner-freebsd-net@FreeBSD.ORG Sat Nov 22 17:09:24 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0E89ADF0; Sat, 22 Nov 2014 17:09:24 +0000 (UTC) Received: from cyrus.watson.org (cyrus.watson.org [198.74.231.69]) by mx1.freebsd.org (Postfix) with ESMTP id D7D0B976; Sat, 22 Nov 2014 17:09:23 +0000 (UTC) Received: from [10.108.122.253] (unknown [46.233.116.248]) by cyrus.watson.org (Postfix) with ESMTPSA id 5B1E846B38; Sat, 22 Nov 2014 12:09:14 -0500 (EST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: VIMAGE UDP memory leak fix From: "Robert N. M. Watson" In-Reply-To: Date: Sat, 22 Nov 2014 17:09:09 +0000 Content-Transfer-Encoding: quoted-printable Message-Id: <85C7A32E-121D-495F-93C7-9D2B2F134FF6@FreeBSD.org> References: <20141121002937.4f82daea@x23> <9300CB5F-6140-4C49-B026-EB69B0E8B37E@FreeBSD.org> <20141121120201.6c77ea5b@x23> <20141121162042.449b22dc@x23> <072B7B0F-4DE3-4D37-BC94-1DEA38CF3B12@FreeBSD.org> To: Adrian Chadd X-Mailer: Apple Mail (2.1878.6) Cc: Craig Rodrigues , FreeBSD Net , "Bjoern A. Zeeb" , Marko Zec X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Nov 2014 17:09:24 -0000 On 21 Nov 2014, at 17:40, Adrian Chadd wrote: >>> Skimming through a bunch of hosts with moderately loaded hosts with >>> reasonably high uptime I couldn't find one where = net.inet.tcp.timer_race >>> was not zero. A ny suggestions how to best reproduce the race(s) in >>> tcp_timer.c? >>=20 >> They would likely occur only on very highly loaded hosts, as they = require race conditions to arise between TCP timers and TCP close. I = think I did manage to reproduce it at one stage, and left the counter in = to see if we could spot it in production, and I have had (multiple) = reports of it in deployed systems. I'm not sure it's worth trying to = reproduce them, given that knowledge -- we should simply fix them. >=20 > Wasn't this just fixed by Julien @ Verisign? I don't believe so, although it's the kind of thing Julien is very good = at fixing! The issue here is that we can't call callout_drain() from contexts where = we finalise TCP connection close and attempt to free the inpcb. The = 'easy' fix is to create a taskqueue thread to do the callout_drain() in = the event that we discover that callout_stop() isn't able to guarantee = that pending callouts are neither in execution nor scheduled. We'd then = defer the very tail of TCP teardown to that asynchronous context rather = than trying to do it to completion in the current (and rather more = sensitive) one. This would happen only very in frequently so have little = overhead in practice, although one would want to carefully look at the = sync behaviour to make sure it wasn't frequently enough that a backlog = might build up. > As for the vimage stability side of things - I'd really like to see > some VIMAGE torture tests written. Stuff like "do a high rate TCP > connection test whilst creating and destroying VIMAGEs." ... and even for non-VIMAGE. :-) Robert