From owner-freebsd-net@FreeBSD.ORG  Fri Nov 21 11:22:29 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id C20F8956;
 Fri, 21 Nov 2014 11:22:29 +0000 (UTC)
Received: from cyrus.watson.org (cyrus.watson.org [198.74.231.69])
 by mx1.freebsd.org (Postfix) with ESMTP id 78AEFAE;
 Fri, 21 Nov 2014 11:22:29 +0000 (UTC)
Received: from [10.108.126.128] (unknown [46.233.116.131])
 by cyrus.watson.org (Postfix) with ESMTPSA id 8EB8146B85;
 Fri, 21 Nov 2014 06:22:21 -0500 (EST)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Subject: Re: VIMAGE UDP memory leak fix
From: "Robert N. M. Watson" <rwatson@FreeBSD.org>
In-Reply-To: <20141121120201.6c77ea5b@x23>
Date: Fri, 21 Nov 2014 11:22:18 +0000
Content-Transfer-Encoding: quoted-printable
Message-Id: <597BD146-88B1-47E6-A373-7004CFF8AEBA@FreeBSD.org>
References: <CAG=rPVehky00X4MuQQ-_Oe5ezWg52ZZrPASAh9GBy7baYv78CA@mail.gmail.com>
 <20141121002937.4f82daea@x23>
 <A4D676B3-6C50-47F7-8CFD-50B44FF4BE98@FreeBSD.org>
 <9300CB5F-6140-4C49-B026-EB69B0E8B37E@FreeBSD.org>
 <20141121120201.6c77ea5b@x23>
To: Marko Zec <zec@fer.hr>
X-Mailer: Apple Mail (2.1878.6)
Cc: Craig Rodrigues <rodrigc@freebsd.org>,
 FreeBSD Net <freebsd-net@freebsd.org>, "Bjoern A. Zeeb" <bz@FreeBSD.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Nov 2014 11:22:30 -0000


On 21 Nov 2014, at 11:02, Marko Zec <zec@fer.hr> wrote:

> Now that we've found ourselves in this discussion, I'm really
> becoming curious why exactly do we need UMA_ZONE_NOFREE for network
> stack zones at all?   Admittedly, I always thought that the primary
> purpose of UMA_ZONE_NOFREE was to prevent uma_reclaim() from paging =
out
> _used_ zone pages, but reviewing the uma code reveals that this might
> not be the case, i.e. that NOFREE only prevents _unused_ pages to be
> freed by uma_reclaim().
>=20
> Moreover, all uma_zalloc() calls as far as I can see are flagged as
> M_NOWAIT and are followed by checks for allocation failures, so that
> part seems to be covered.
>=20
> So, what's really the problem which UMA_ZONE_NOFREE flagging is =
supposed
> to solve these days? (you claim that we clearly need it for TCP - =
why)?

UMA_ZONE_NOFREE tells UMA that it can't reclaim unused slabs for the =
zone to be returned to the VM system for reuse elsewhere under memory =
pressure. UMA memory isn't pageable, so there's no link to paging =
policy: although soft-TLB systems might experience TLB miss exceptions =
on UMA-allocated kernel memory, you should never experience a page fault =
against it (in absence of a bug). Reclaim of unused slabs can happen, =
for example, if VM discovers it is low on free pages, in which case it =
notifies various kernel subsystems that it is feeling a bit cramped -- =
that same mechanism that, for example, triggers TCP to throw away =
reassembly buffers that haven't yet been ACK'd (although might have been =
SACK'd). You might expect this to happen in situations where first a =
large load spike happens for a particular UMA type (e.g., a DDoS opens =
lots of TCP connections), and then they are freed, leading to lots of =
socket/incpb slabs lying around unused, which eventually VM will ask be =
returned. It is highly desirable for UMA_ZONE_NOFREE to be removed from =
zones wherever possible so that memory can be returned under such =
circumstances, and it is not a good feature that the flag is present =
anywhere.

Subsystems pick up a dependence on UMA_ZONE_NOFREE if freed objects =
might be referenced after free. My understanding is that this is pretty =
old inherited behaviour from prior kernel memory allocators that didn't =
know how to return memory to VM. Given that property, it was safe to =
write code that might, for the purposes of efficiency, assume that it =
could walk data structures of the type with fewer synchronisation =
overheads -- or where synchronisation isn't possible (e.g., for direct =
access to kernel memory via /dev/kmem). We have been attempting to =
expunge those assumptions wherever possible -- these days, netstat uses =
sysctl()s that acquire references to all live inpcbs keeping them valid =
while they are copied out (you can't hold low-level locks during =
copyout() as sysctl might encounter a paging event writing to user =
memory). Convincing yourself that all such assumptions have been removed =
is a moderate amount of work, and if you get it wrong, you get =
use-after-free races that occur only in low-memory conditions, which are =
quite hard to figure out (read: almost impossible).

Bjoern can say more about what motivated his specific comment -- I had =
hoped that we'd quietly lost dependence on NOFREE over the last decade =
and could finally garbage collect it, but perhaps not!

Robert=