From owner-freebsd-arch@FreeBSD.ORG Sun Oct 5 06:37:57 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BC3A5A4F; Sun, 5 Oct 2014 06:37:57 +0000 (UTC) Received: from mail-wg0-x22c.google.com (mail-wg0-x22c.google.com [IPv6:2a00:1450:400c:c00::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 56F9EA64; Sun, 5 Oct 2014 06:37:56 +0000 (UTC) Received: by mail-wg0-f44.google.com with SMTP id y10so4249245wgg.27 for ; Sat, 04 Oct 2014 23:37:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=HDg2Nn1mBArcUQc6g+9eDevupGTF//z5KwR/cSmRGAw=; b=q5MVhQB6QF5xfFAqdGjBk+pyoZoUAUYdLXY7Am1LWb7knB3uu3b0xj0K1oeyQ75Wc7 z7Q4HlyimVlcNzDvlEUUODO6qzEOJID7P5n9XpGSNx2RiSZeEaxs5KX0B3VTu8bCcfAL K/p2svARwDMiA9SxJAQBqnwXpvwfmW7Zb7fhxrCdrtNFE5zmR7a2sERf97xzD09q8gnm nkZ/Y5kx3+ck0T0NtcE+2ilw6weRvBT+t++84zqRmGjFotm3N4GkPn0zs/awMPEMQXpD 80MZ9D/EBz44XuChqY33vVzTw9P5LsEPyhHVmzIcY7hF6lenRJLc0seL4sHqBu+TAw7j TU3Q== X-Received: by 10.180.100.38 with SMTP id ev6mr10221457wib.83.1412491074523; Sat, 04 Oct 2014 23:37:54 -0700 (PDT) Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net. [2001:470:1f08:1f7::2]) by mx.google.com with ESMTPSA id k2sm12955985wjy.34.2014.10.04.23.37.53 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Sat, 04 Oct 2014 23:37:53 -0700 (PDT) Date: Sun, 5 Oct 2014 08:37:51 +0200 From: Mateusz Guzik To: Attilio Rao Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory barriers. Message-ID: <20141005063750.GA9262@dft-labs.eu> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <1408064112-573-2-git-send-email-mjguzik@gmail.com> <20140816093811.GX2737@kib.kiev.ua> <20140816185406.GD2737@kib.kiev.ua> <20140817012646.GA21025@dft-labs.eu> <20141004052851.GA27891@dft-labs.eu> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Alan Cox , Konstantin Belousov , Johan Schuijt , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Oct 2014 06:37:57 -0000 On Sat, Oct 04, 2014 at 11:37:16AM +0200, Attilio Rao wrote: > On Sat, Oct 4, 2014 at 7:28 AM, Mateusz Guzik wrote: > > Reviving. Sorry everyone for such big delay, $life. > > > > On Tue, Aug 19, 2014 at 02:24:16PM -0500, Alan Cox wrote: > >> On Sat, Aug 16, 2014 at 8:26 PM, Mateusz Guzik wrote: > >> > Well, my memory-barrier-and-so-on-fu is rather weak. > >> > > >> > I had another look at the issue. At least on amd64, it looks like only > >> > compiler barrier is required for both reads and writes. > >> > > >> > According to AMD64 Architecture Programmer’s Manual Volume 2: System > >> > Programming, 7.2 Multiprocessor Memory Access Ordering states: > >> > > >> > "Loads do not pass previous loads (loads are not reordered). Stores do > >> > not pass previous stores (stores are not reordered)" > >> > > >> > Since the code modifying stuff only performs a series of writes and we > >> > expect exclusive writers, I find it applicable to this scenario. > >> > > >> > I checked linux sources and generated assembly, they indeed issue only > >> > a compiler barrier on amd64 (and for intel processors as well). > >> > > >> > atomic_store_rel_int on amd64 seems fine in this regard, but the only > >> > function for loads issues lock cmpxhchg which kills performance > >> > (median 55693659 -> 12789232 ops in a microbenchmark) for no gain. > >> > > >> > Additionally release and acquire semantics seems to be a stronger than > >> > needed guarantee. > >> > > >> > > >> > >> This statement left me puzzled and got me to look at our x86 atomic.h for > >> the first time in years. It appears that our implementation of > >> atomic_load_acq_int() on x86 is, umm ..., unconventional. That is, it is > >> enforcing a constraint that simple acquire loads don't normally enforce. > >> For example, the C11 stdatomic.h simple acquire load doesn't enforce this > >> constraint. Moreover, our own implementation of atomic_load_acq_int() on > >> ia64, where the mapping from atomic_load_acq_int() to machine instructions > >> is straightforward, doesn't enforce this constraint either. > >> > > > > By 'this constraint' I presume you mean full memory barrier. > > > > It is unclear to me if one can just get rid of it currently. It > > definitely would be beneficial. > > > > In the meantime, if for some reason full barrier is still needed, we can > > speed up concurrent load_acq of the same var considerably. There is no > > need to lock cmpxchg on the same address. We should be able to replace > > it with +/-: > > lock add $0,(%rsp); > > movl ...; > > When I looked into some AMD manual (I think the same one which reports > using lock add $0, (%rsp)) I recall that the (reported) added > instructions latencies of "lock add" + "movl" is superior than the > single "cmpxchg". > Moreover, I think that the simple movl is going to lock the cache-line > anyway, so I doubt the "lock add" is going to provide any benefit. The > only benefit I can think of is that we will be able to use an _acq() > barriers on read-only memory with this trick (which is not possible > today as timecounters code can testify). > > If the latencies for "lock add" + "movl" is changed in the latest > Intel processors I can't say for sure, it may be worth to look at it. > I stated in my previous mail that it is faster, and I have trivial benchmark to back it up. In fget_unlocked there is an atomic_load_acq at the beginning (I have patches which get rid of it, btw). After the code is changed to lock add + movl, we get a significant speed up in a microbenchmark of 15 threads going read -> fget_unlocked. x vanilla-readpipe + lockadd-readpipe N Min Max Median Avg Stddev x 20 11073800 13429593 12266195 12190982 629380.16 + 20 53414354 54152272 53567250 53791945 322012.74 Difference at 95.0% confidence 4.1601e+07 +/- 319962 341.244% +/- 2.62458% (Student's t, pooled s = 499906) This is on Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz. Seems to make sense since we only read from shared area and lock add is performed on addresses private to executing threads. fwiw, lock cmpxchg on %rsp gives comparable speed up. Of course one would need to actually measure this stuff to get a better idea what's really going on within cpu. -- Mateusz Guzik From owner-freebsd-arch@FreeBSD.ORG Tue Oct 7 18:23:22 2014 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5FC9F3DD for ; Tue, 7 Oct 2014 18:23:22 +0000 (UTC) Received: from mx1.sbone.de (mx1.sbone.de [IPv6:2a01:4f8:130:3ffc::401:25]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "mx1.sbone.de", Issuer "SBone.DE" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 13C70AD5 for ; Tue, 7 Oct 2014 18:23:22 +0000 (UTC) Received: from mail.sbone.de (mail.sbone.de [IPv6:fde9:577b:c1a9:31::2013:587]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mx1.sbone.de (Postfix) with ESMTPS id F178825D388C for ; Tue, 7 Oct 2014 18:23:18 +0000 (UTC) Received: from content-filter.sbone.de (content-filter.sbone.de [IPv6:fde9:577b:c1a9:31::2013:2742]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.sbone.de (Postfix) with ESMTPS id 33F5BC76FF6 for ; Tue, 7 Oct 2014 18:23:18 +0000 (UTC) X-Virus-Scanned: amavisd-new at sbone.de Received: from mail.sbone.de ([IPv6:fde9:577b:c1a9:31::2013:587]) by content-filter.sbone.de (content-filter.sbone.de [fde9:577b:c1a9:31::2013:2742]) (amavisd-new, port 10024) with ESMTP id sQaOAwmNRXtf for ; Tue, 7 Oct 2014 18:23:16 +0000 (UTC) Received: from [IPv6:fde9:577b:c1a9:4420:cabc:c8ff:fe8b:4fe6] (orange-tun0-ula.sbone.de [IPv6:fde9:577b:c1a9:4420:cabc:c8ff:fe8b:4fe6]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.sbone.de (Postfix) with ESMTPSA id C70C0C76FF5 for ; Tue, 7 Oct 2014 18:23:15 +0000 (UTC) Content-Type: text/plain; charset=windows-1252 Subject: =?windows-1252?Q?PMC_=93unhalted-core-cycles=94_counted_=27Branc?= =?windows-1252?Q?h_Instruction_Retired=27___=5Bwas=3A_Fwd=3A_svn?= =?windows-1252?Q?_commit=3A_r272713_-_head/sys/dev/hwpmc=5D?= Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) From: "Bjoern A. Zeeb" Date: Tue, 7 Oct 2014 18:23:14 +0000 Content-Transfer-Encoding: quoted-printable Reply-To: bz@FreeBSD.org Message-Id: <79B730FB-26F4-4E74-BB68-D8E681278A4F@FreeBSD.org> References: <201410071800.s97I0ZV7023909@svn.freebsd.org> To: arch@FreeBSD.org X-Mailer: Apple Mail (2.1878.6) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Oct 2014 18:23:22 -0000 Hi, in case you have used =93unhalted-core-cycles=94 for any PMC = measurements in the last 17 months in HEAD or stable/10 (possibly more = branches), you might want to go and re-visit your results. In the interest of my research I=92d appreciate if you could also drop = me a private email (Reply-To set, don=92t reply to all for this unless = you want to) if this might have affected any of your work. Best Regards, Bjoern Begin forwarded message: > From: "Bjoern A. Zeeb" > Subject: svn commit: r272713 - head/sys/dev/hwpmc > Date: 7 Oct 2014 18:00:35 GMT > To: src-committers@freebsd.org, svn-src-all@freebsd.org, = svn-src-head@freebsd.org >=20 > Author: bz > Date: Tue Oct 7 18:00:34 2014 > New Revision: 272713 > URL: https://svnweb.freebsd.org/changeset/base/272713 >=20 > Log: > Since introducing the extra mapping in r250103 for architectural = performance > events we have actually counted 'Branch Instruction Retired' when = people > asked for 'Unhalted core cycles' using the 'unhalted-core-cycles' = event mask > mnemonic. >=20 > Reviewed by: jimharris > Discussed with: gnn, rwatson > MFC after: 3 days > Sponsored by: DARPA/AFRL >=20 > Modified: > head/sys/dev/hwpmc/hwpmc_core.c >=20 > Modified: head/sys/dev/hwpmc/hwpmc_core.c > = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > --- head/sys/dev/hwpmc/hwpmc_core.c Tue Oct 7 17:39:30 2014 = (r272712) > +++ head/sys/dev/hwpmc/hwpmc_core.c Tue Oct 7 18:00:34 2014 = (r272713) > @@ -1796,7 +1796,7 @@ iap_is_event_architectural(enum pmc_even > switch (pe) { > case PMC_EV_IAP_ARCH_UNH_COR_CYC: > ae =3D CORE_AE_UNHALTED_CORE_CYCLES; > - *map =3D PMC_EV_IAP_EVENT_C4H_00H; > + *map =3D PMC_EV_IAP_EVENT_3CH_00H; > break; > case PMC_EV_IAP_ARCH_INS_RET: > ae =3D CORE_AE_INSTRUCTION_RETIRED; >=20 =97=20 Bjoern A. Zeeb "Come on. Learn, goddamn it.", WarGames, 1983 From owner-freebsd-arch@FreeBSD.ORG Wed Oct 8 01:37:04 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 433CBA84 for ; Wed, 8 Oct 2014 01:37:04 +0000 (UTC) Received: from mail-ig0-x22e.google.com (mail-ig0-x22e.google.com [IPv6:2607:f8b0:4001:c05::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 16D17EAE for ; Wed, 8 Oct 2014 01:37:04 +0000 (UTC) Received: by mail-ig0-f174.google.com with SMTP id a13so233987igq.7 for ; Tue, 07 Oct 2014 18:37:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=d7rVKvqgIZ4ECIvpzNd75YgpffPI8Sd0Dr1iBJwN+PA=; b=pQTbh/HtPZIqjoqtsbFcZX2xda/Y/0EiW+FKDYM1nSMXwO13OcoBs6CwpY908y3If9 WqbVmJYmYZtykIl4Lri8A1JbOe0dalE9nuZ+wy2BSsBHT8qKnWsQDhfm/A52h1cxpurr ZWCg5mAnF47bHl+dETGjIZvxOLzsmFay+vcRh0kwzhNZPyz1tKb2sx6bCFyf7JiZj4Eq KGOM04QdAzkooW0Ta66IS+6HKu/9HsyjMLxdskK3kErv6HHzrJG6h9demI+2rVcUbAGn LVhYrq1JFgsSb1p5RZ4J++7vnh2KhbkoHsD2S3j6PqEUEfRofbiuHao0Edax+BJ8wDUp V4kg== MIME-Version: 1.0 X-Received: by 10.50.72.3 with SMTP id z3mr11437457igu.36.1412732223433; Tue, 07 Oct 2014 18:37:03 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.50.78.4 with HTTP; Tue, 7 Oct 2014 18:37:03 -0700 (PDT) Date: Tue, 7 Oct 2014 18:37:03 -0700 X-Google-Sender-Auth: bQmrTdQAkr3ExcwJ6310eOfFDMc Message-ID: Subject: [rfc] enumerating device / bus domain information From: Adrian Chadd To: "freebsd-arch@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Oct 2014 01:37:04 -0000 Hi, Right now we're not enumerating any NUMA domain information about devices. The more recent intel NUMA stuff has some extra affinity information for devices that (eventually) will allow us to bind kernel/user threads and/or memory allocation to devices to keep access local. There's a penalty for DMAing in/out of remote memory, so we'll want to figure out what counts as "Local" for memory allocation and perhaps constrain the CPU set that worker threads for a device run on. This patch adds a few things: * it adds a bus_if.m method for fetching the VM domain ID of a given device; or ENOENT if it's not in a VM domain; * it adds some hooks to print the numa-domain out of a device if it exists; * it adds hooks in srat.c to store the original proximity-id values and uses them to map PXM to FreeBSD VM domain IDs; * the ACPI code now has bus methods to enumerate which PXM (and thus which VM domain) a device is on. The review for it is here: https://reviews.freebsd.org/D906 Please ignore the vm_phys.c patch; it's purely for experimenting on my side and won't be committed part of this work. Thanks, -a From owner-freebsd-arch@FreeBSD.ORG Wed Oct 8 15:38:07 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 98F60D2E; Wed, 8 Oct 2014 15:38:07 +0000 (UTC) Received: from mail-qc0-x22d.google.com (mail-qc0-x22d.google.com [IPv6:2607:f8b0:400d:c01::22d]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 423E7F9C; Wed, 8 Oct 2014 15:38:07 +0000 (UTC) Received: by mail-qc0-f173.google.com with SMTP id x13so7745728qcv.32 for ; Wed, 08 Oct 2014 08:38:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=O+Xo9y/cw0gt6v5kyUbU6vKfmn2PeDaNFfT6TFaMgaM=; b=I7LsxRiNu7Hk2OSLI45UJ8thlqnMPhPt1OexCSYClBaBJXR65lgtCFGENMh7pHEutJ 6YLfXCe4bQZ6aQKisWuiFMmMnI+S6l67jWyRhS0nWO/DjEg6cFaCybPl/F+ZXP8/IPdz q/cdPIxQFU8VSIL8sbzqDXN5/+OiBrG4c0HuRk9euvGCxgDSjQAerPDlC9SwVL8FVzwY LSY/aA1P/L1e2haMl+QVNk7SfqpzlSdK+zQg+6ckK6TjuVHWYtcchGRF8oKl/fHYJyDj eAbAWVxwrq/SgESpSNK4FUXqU/VKh8B8U0iPVPArCFxkOBkBPz3uaPVJEsNL1RuMaJQX sbTg== MIME-Version: 1.0 X-Received: by 10.229.79.71 with SMTP id o7mr14205543qck.17.1412782686311; Wed, 08 Oct 2014 08:38:06 -0700 (PDT) Received: by 10.140.23.242 with HTTP; Wed, 8 Oct 2014 08:38:06 -0700 (PDT) In-Reply-To: <5428AF3B.1030906@rice.edu> References: <5428AF3B.1030906@rice.edu> Date: Wed, 8 Oct 2014 17:38:06 +0200 Message-ID: Subject: Re: vm_page_array and VM_PHYSSEG_SPARSE From: Svatopluk Kraus To: Alan Cox Content-Type: multipart/mixed; boundary=001a1133bf0229b31b0504eb1a08 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: alc@freebsd.org, FreeBSD Arch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Oct 2014 15:38:07 -0000 --001a1133bf0229b31b0504eb1a08 Content-Type: text/plain; charset=UTF-8 On Mon, Sep 29, 2014 at 3:00 AM, Alan Cox wrote: > On 09/27/2014 03:51, Svatopluk Kraus wrote: > > > On Fri, Sep 26, 2014 at 8:08 PM, Alan Cox wrote: > >> >> >> On Wed, Sep 24, 2014 at 7:27 AM, Svatopluk Kraus >> wrote: >> >>> Hi, >>> >>> I and Michal are finishing new ARM pmap-v6 code. There is one problem >>> we've >>> dealt with somehow, but now we would like to do it better. It's about >>> physical pages which are allocated before vm subsystem is initialized. >>> While later on these pages could be found in vm_page_array when >>> VM_PHYSSEG_DENSE memory model is used, it's not true for >>> VM_PHYSSEG_SPARSE >>> memory model. And ARM world uses VM_PHYSSEG_SPARSE model. >>> >>> It really would be nice to utilize vm_page_array for such preallocated >>> physical pages even when VM_PHYSSEG_SPARSE memory model is used. Things >>> could be much easier then. In our case, it's about pages which are used >>> for >>> level 2 page tables. In VM_PHYSSEG_SPARSE model, we have two sets of such >>> pages. First ones are preallocated and second ones are allocated after vm >>> subsystem was inited. We must deal with each set differently. So code is >>> more complex and so is debugging. >>> >>> Thus we need some method how to say that some part of physical memory >>> should be included in vm_page_array, but the pages from that region >>> should >>> not be put to free list during initialization. We think that such >>> possibility could be utilized in general. There could be a need for some >>> physical space which: >>> >>> (1) is needed only during boot and later on it can be freed and put to vm >>> subsystem, >>> >>> (2) is needed for something else and vm_page_array code could be used >>> without some kind of its duplication. >>> >>> There is already some code which deals with blacklisted pages in >>> vm_page.c >>> file. So the easiest way how to deal with presented situation is to add >>> some callback to this part of code which will be able to either exclude >>> whole phys_avail[i], phys_avail[i+1] region or single pages. As the >>> biggest >>> phys_avail region is used for vm subsystem allocations, there should be >>> some more coding. (However, blacklisted pages are not dealt with on that >>> part of region.) >>> >>> We would like to know if there is any objection: >>> >>> (1) to deal with presented problem, >>> (2) to deal with the problem presented way. >>> Some help is very appreciated. Thanks >>> >>> >> >> As an experiment, try modifying vm_phys.c to use dump_avail instead of >> phys_avail when sizing vm_page_array. On amd64, where the same problem >> exists, this allowed me to use VM_PHYSSEG_SPARSE. Right now, this is >> probably my preferred solution. The catch being that not all architectures >> implement dump_avail, but my recollection is that arm does. >> > > Frankly, I would prefer this too, but there is one big open question: > > What is dump_avail for? > > > > dump_avail[] is solving a similar problem in the minidump code, hence, the > prefix "dump_" in its name. In other words, the minidump code couldn't use > phys_avail[] either because it didn't describe the full range of physical > addresses that might be included in a minidump, so dump_avail[] was created. > > There is already precedent for what I'm suggesting. dump_avail[] is > already (ab)used outside of the minidump code on x86 to solve this same > problem in x86/x86/nexus.c, and on arm in arm/arm/mem.c. > > > Using it for vm_page_array initialization and segmentation means that > phys_avail must be a subset of it. And this must be stated and be visible > enough. Maybe it should be even checked in code. I like the idea of > thinking about dump_avail as something what desribes all memory in a > system, but it's not how dump_avail is defined in archs now. > > > > When you say "it's not how dump_avail is defined in archs now", I'm not > sure whether you're talking about the code or the comments. In terms of > code, dump_avail[] is a superset of phys_avail[], and I'm not aware of any > code that would have to change. In terms of comments, I did a grep looking > for comments defining what dump_avail[] is, because I couldn't remember > any. I found one ... on arm. So, I don't think it's a onerous task > changing the definition of dump_avail[]. :-) > > Already, as things stand today with dump_avail[] being used outside of the > minidump code, one could reasonably argue that it should be renamed to > something like phys_exists[]. > > > > I will experiment with it on monday then. However, it's not only about how > memory segments are created in vm_phys.c, but it's about how vm_page_array > size is computed in vm_page.c too. > > > > Yes, and there is also a place in vm_reserv.c that needs to change. I've > attached the patch that I developed and tested a long time ago. It still > applies cleanly and runs ok on amd64. > > > Well, I've created and tested minimalistic patch which - I hope - is commitable. It runs ok on pandaboard (arm-v6) and solves presented problem. I would really appreciate if this will be commited. Thanks. BTW, while I was inspecting all archs, I think that maybe it's time to do what was done for busdma not long ago. There are many similar codes across archs which deal with physical memory and could be generalized and put to kern/subr_physmem.c for utilization. All work with physical memory could be simplify to two arrays of regions. phys_present[] ... describes all present physical memory regions phys_exclude[] ... describes various exclusions from phys_present[] Each excluded region will be labeled by flags to say what kind of exclusion it is. The flags like NODUMP, NOALLOC, NOMANAGE, NOBOUNCE, NOMEMRW could be combined. This idea is taken from sys/arm/arm/physmem.c. All other arrays like phys_managed[], phys_avail[], dump_avail[] will be created from these phys_present[] and phys_exclude[]. This way bootstrap codes in archs could be simplified and unified. For example, dealing with either hw.physmem or page with PA 0x00000000 could be transparent. I'm prepared to volunteer if the thing is ripe. However, some tutor will be looked for. Svata --001a1133bf0229b31b0504eb1a08 Content-Type: application/octet-stream; name="phys_managed.patch" Content-Disposition: attachment; filename="phys_managed.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_i10u373a1 SW5kZXg6IHN5cy9hcm0vYXJtL3BtYXAtdjYuYwo9PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Ci0tLSBzeXMvYXJtL2FybS9w bWFwLXY2LmMJKHJldmlzaW9uIDI3MjY4MikKKysrIHN5cy9hcm0vYXJtL3BtYXAtdjYuYwkod29y a2luZyBjb3B5KQpAQCAtMTM0Myw4ICsxMzQzLDggQEAKIAkvKgogCSAqIENhbGN1bGF0ZSB0aGUg c2l6ZSBvZiB0aGUgcHYgaGVhZCB0YWJsZSBmb3Igc3VwZXJwYWdlcy4KIAkgKi8KLQlmb3IgKGkg PSAwOyBwaHlzX2F2YWlsW2kgKyAxXTsgaSArPSAyKTsKLQlwdl9ucGcgPSByb3VuZF8xbXBhZ2Uo cGh5c19hdmFpbFsoaSAtIDIpICsgMV0pIC8gTkJQRFI7CisJZm9yIChpID0gMDsgcGh5c19tYW5h Z2VkW2kgKyAxXTsgaSArPSAyKTsKKwlwdl9ucGcgPSByb3VuZF8xbXBhZ2UocGh5c19tYW5hZ2Vk WyhpIC0gMikgKyAxXSkgLyBOQlBEUjsKIAogCS8qCiAJICogQWxsb2NhdGUgbWVtb3J5IGZvciB0 aGUgcHYgaGVhZCB0YWJsZSBmb3Igc3VwZXJwYWdlcy4KSW5kZXg6IHN5cy92bS92bV9wYWdlLmMK PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PQotLS0gc3lzL3ZtL3ZtX3BhZ2UuYwkocmV2aXNpb24gMjcyNjgyKQorKysgc3lz L3ZtL3ZtX3BhZ2UuYwkod29ya2luZyBjb3B5KQpAQCAtMTI4LDYgKzEyOCw4IEBACiAKIHN0cnVj dCBtdHhfcGFkYWxpZ24gcGFfbG9ja1tQQV9MT0NLX0NPVU5UXTsKIAordm1fcGFkZHJfdCAqcGh5 c19tYW5hZ2VkOworCiB2bV9wYWdlX3Qgdm1fcGFnZV9hcnJheTsKIGxvbmcgdm1fcGFnZV9hcnJh eV9zaXplOwogbG9uZyBmaXJzdF9wYWdlOwpAQCAtMzAxLDMxICszMDMsNDQgQEAKIAliaWdnZXN0 b25lID0gMDsKIAl2YWRkciA9IHJvdW5kX3BhZ2UodmFkZHIpOwogCi0JZm9yIChpID0gMDsgcGh5 c19hdmFpbFtpICsgMV07IGkgKz0gMikgewotCQlwaHlzX2F2YWlsW2ldID0gcm91bmRfcGFnZShw aHlzX2F2YWlsW2ldKTsKLQkJcGh5c19hdmFpbFtpICsgMV0gPSB0cnVuY19wYWdlKHBoeXNfYXZh aWxbaSArIDFdKTsKKyNpZiBkZWZpbmVkKF9fYXJtX18pCisJcGh5c19tYW5hZ2VkID0gZHVtcF9h dmFpbDsKKyNlbHNlCisJcGh5c19tYW5hZ2VkID0gcGh5c19hdmFpbDsKKyNlbmRpZgkKKworCWZv ciAoaSA9IDA7IHBoeXNfbWFuYWdlZFtpICsgMV07IGkgKz0gMikgeworCQlwaHlzX21hbmFnZWRb aV0gPSByb3VuZF9wYWdlKHBoeXNfbWFuYWdlZFtpXSk7CisJCXBoeXNfbWFuYWdlZFtpICsgMV0g PSB0cnVuY19wYWdlKHBoeXNfbWFuYWdlZFtpICsgMV0pOwogCX0KIAotCWxvd193YXRlciA9IHBo eXNfYXZhaWxbMF07Ci0JaGlnaF93YXRlciA9IHBoeXNfYXZhaWxbMV07CisJbG93X3dhdGVyID0g cGh5c19tYW5hZ2VkWzBdOworCWhpZ2hfd2F0ZXIgPSBwaHlzX21hbmFnZWRbMV07CiAKKwlmb3Ig KGkgPSAwOyBwaHlzX21hbmFnZWRbaSArIDFdOyBpICs9IDIpIHsKKwkJaWYgKHBoeXNfbWFuYWdl ZFtpXSA8IGxvd193YXRlcikKKwkJCWxvd193YXRlciA9IHBoeXNfbWFuYWdlZFtpXTsKKwkJaWYg KHBoeXNfbWFuYWdlZFtpICsgMV0gPiBoaWdoX3dhdGVyKQorCQkJaGlnaF93YXRlciA9IHBoeXNf bWFuYWdlZFtpICsgMV07CisJfQorCisjaWZkZWYgWEVOCisJbG93X3dhdGVyID0gMDsKKyNlbmRp ZgkKKwogCWZvciAoaSA9IDA7IHBoeXNfYXZhaWxbaSArIDFdOyBpICs9IDIpIHsKLQkJdm1fcGFk ZHJfdCBzaXplID0gcGh5c19hdmFpbFtpICsgMV0gLSBwaHlzX2F2YWlsW2ldOworCQl2bV9wYWRk cl90IHNpemU7CiAKKwkJcGh5c19hdmFpbFtpXSA9IHJvdW5kX3BhZ2UocGh5c19hdmFpbFtpXSk7 CisJCXBoeXNfYXZhaWxbaSArIDFdID0gdHJ1bmNfcGFnZShwaHlzX2F2YWlsW2kgKyAxXSk7CisJ CQorCQlzaXplID0gcGh5c19hdmFpbFtpICsgMV0gLSBwaHlzX2F2YWlsW2ldOwogCQlpZiAoc2l6 ZSA+IGJpZ2dlc3RzaXplKSB7CiAJCQliaWdnZXN0b25lID0gaTsKIAkJCWJpZ2dlc3RzaXplID0g c2l6ZTsKIAkJfQotCQlpZiAocGh5c19hdmFpbFtpXSA8IGxvd193YXRlcikKLQkJCWxvd193YXRl ciA9IHBoeXNfYXZhaWxbaV07Ci0JCWlmIChwaHlzX2F2YWlsW2kgKyAxXSA+IGhpZ2hfd2F0ZXIp Ci0JCQloaWdoX3dhdGVyID0gcGh5c19hdmFpbFtpICsgMV07CiAJfQogCi0jaWZkZWYgWEVOCi0J bG93X3dhdGVyID0gMDsKLSNlbmRpZgkKLQogCWVuZCA9IHBoeXNfYXZhaWxbYmlnZ2VzdG9uZSsx XTsKIAogCS8qCkBAIC0zOTMsOCArNDA4LDggQEAKIAlmaXJzdF9wYWdlID0gbG93X3dhdGVyIC8g UEFHRV9TSVpFOwogI2lmZGVmIFZNX1BIWVNTRUdfU1BBUlNFCiAJcGFnZV9yYW5nZSA9IDA7Ci0J Zm9yIChpID0gMDsgcGh5c19hdmFpbFtpICsgMV0gIT0gMDsgaSArPSAyKQotCQlwYWdlX3Jhbmdl ICs9IGF0b3AocGh5c19hdmFpbFtpICsgMV0gLSBwaHlzX2F2YWlsW2ldKTsKKwlmb3IgKGkgPSAw OyBwaHlzX21hbmFnZWRbaSArIDFdICE9IDA7IGkgKz0gMikKKwkJcGFnZV9yYW5nZSArPSBhdG9w KHBoeXNfbWFuYWdlZFtpICsgMV0gLSBwaHlzX21hbmFnZWRbaV0pOwogI2VsaWYgZGVmaW5lZChW TV9QSFlTU0VHX0RFTlNFKQogCXBhZ2VfcmFuZ2UgPSBoaWdoX3dhdGVyIC8gUEFHRV9TSVpFIC0g Zmlyc3RfcGFnZTsKICNlbHNlCkluZGV4OiBzeXMvdm0vdm1fcGFnZS5oCj09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KLS0t IHN5cy92bS92bV9wYWdlLmgJKHJldmlzaW9uIDI3MjY4MikKKysrIHN5cy92bS92bV9wYWdlLmgJ KHdvcmtpbmcgY29weSkKQEAgLTM2OCw2ICszNjgsOCBAQAogCiBleHRlcm4gaW50IHZtX3BhZ2Vf emVyb19jb3VudDsKIAorZXh0ZXJuIHZtX3BhZGRyX3QgKnBoeXNfbWFuYWdlZDsJLyogcGFnZXMg bWFuYWdlZCBieSB2bV9wYWdlX2FycmF5ICovCisKIGV4dGVybiB2bV9wYWdlX3Qgdm1fcGFnZV9h cnJheTsJCS8qIEZpcnN0IHJlc2lkZW50IHBhZ2UgaW4gdGFibGUgKi8KIGV4dGVybiBsb25nIHZt X3BhZ2VfYXJyYXlfc2l6ZTsJCS8qIG51bWJlciBvZiB2bV9wYWdlX3QncyAqLwogZXh0ZXJuIGxv bmcgZmlyc3RfcGFnZTsJCQkvKiBmaXJzdCBwaHlzaWNhbCBwYWdlIG51bWJlciAqLwpJbmRleDog c3lzL3ZtL3ZtX3BoeXMuYwo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09Ci0tLSBzeXMvdm0vdm1fcGh5cy5jCShyZXZpc2lv biAyNzI2ODIpCisrKyBzeXMvdm0vdm1fcGh5cy5jCSh3b3JraW5nIGNvcHkpCkBAIC0zNjUsMTcg KzM2NSwxNyBAQAogCXN0cnVjdCB2bV9mcmVlbGlzdCAqZmw7CiAJaW50IGRvbSwgZmxpbmQsIGks IG9pbmQsIHBpbmQ7CiAKLQlmb3IgKGkgPSAwOyBwaHlzX2F2YWlsW2kgKyAxXSAhPSAwOyBpICs9 IDIpIHsKKwlmb3IgKGkgPSAwOyBwaHlzX21hbmFnZWRbaSArIDFdICE9IDA7IGkgKz0gMikgewog I2lmZGVmCVZNX0ZSRUVMSVNUX0lTQURNQQotCQlpZiAocGh5c19hdmFpbFtpXSA8IDE2Nzc3MjE2 KSB7Ci0JCQlpZiAocGh5c19hdmFpbFtpICsgMV0gPiAxNjc3NzIxNikgewotCQkJCXZtX3BoeXNf Y3JlYXRlX3NlZyhwaHlzX2F2YWlsW2ldLCAxNjc3NzIxNiwKKwkJaWYgKHBoeXNfbWFuYWdlZFtp XSA8IDE2Nzc3MjE2KSB7CisJCQlpZiAocGh5c19tYW5hZ2VkW2kgKyAxXSA+IDE2Nzc3MjE2KSB7 CisJCQkJdm1fcGh5c19jcmVhdGVfc2VnKHBoeXNfbWFuYWdlZFtpXSwgMTY3NzcyMTYsCiAJCQkJ ICAgIFZNX0ZSRUVMSVNUX0lTQURNQSk7Ci0JCQkJdm1fcGh5c19jcmVhdGVfc2VnKDE2Nzc3MjE2 LCBwaHlzX2F2YWlsW2kgKyAxXSwKLQkJCQkgICAgVk1fRlJFRUxJU1RfREVGQVVMVCk7CisJCQkJ dm1fcGh5c19jcmVhdGVfc2VnKDE2Nzc3MjE2LAorCQkJCSAgICBwaHlzX21hbmFnZWRbaSArIDFd LCBWTV9GUkVFTElTVF9ERUZBVUxUKTsKIAkJCX0gZWxzZSB7Ci0JCQkJdm1fcGh5c19jcmVhdGVf c2VnKHBoeXNfYXZhaWxbaV0sCi0JCQkJICAgIHBoeXNfYXZhaWxbaSArIDFdLCBWTV9GUkVFTElT VF9JU0FETUEpOworCQkJCXZtX3BoeXNfY3JlYXRlX3NlZyhwaHlzX21hbmFnZWRbaV0sCisJCQkJ ICAgIHBoeXNfbWFuYWdlZFtpICsgMV0sIFZNX0ZSRUVMSVNUX0lTQURNQSk7CiAJCQl9CiAJCQlp ZiAoVk1fRlJFRUxJU1RfSVNBRE1BID49IHZtX25mcmVlbGlzdHMpCiAJCQkJdm1fbmZyZWVsaXN0 cyA9IFZNX0ZSRUVMSVNUX0lTQURNQSArIDE7CkBAIC0zODIsMjEgKzM4MiwyMSBAQAogCQl9IGVs c2UKICNlbmRpZgogI2lmZGVmCVZNX0ZSRUVMSVNUX0hJR0hNRU0KLQkJaWYgKHBoeXNfYXZhaWxb aSArIDFdID4gVk1fSElHSE1FTV9BRERSRVNTKSB7Ci0JCQlpZiAocGh5c19hdmFpbFtpXSA8IFZN X0hJR0hNRU1fQUREUkVTUykgewotCQkJCXZtX3BoeXNfY3JlYXRlX3NlZyhwaHlzX2F2YWlsW2ld LAorCQlpZiAocGh5c19tYW5hZ2VkW2kgKyAxXSA+IFZNX0hJR0hNRU1fQUREUkVTUykgeworCQkJ aWYgKHBoeXNfbWFuYWdlZFtpXSA8IFZNX0hJR0hNRU1fQUREUkVTUykgeworCQkJCXZtX3BoeXNf Y3JlYXRlX3NlZyhwaHlzX21hbmFnZWRbaV0sCiAJCQkJICAgIFZNX0hJR0hNRU1fQUREUkVTUywg Vk1fRlJFRUxJU1RfREVGQVVMVCk7CiAJCQkJdm1fcGh5c19jcmVhdGVfc2VnKFZNX0hJR0hNRU1f QUREUkVTUywKLQkJCQkgICAgcGh5c19hdmFpbFtpICsgMV0sIFZNX0ZSRUVMSVNUX0hJR0hNRU0p OworCQkJCSAgICBwaHlzX21hbmFnZWRbaSArIDFdLCBWTV9GUkVFTElTVF9ISUdITUVNKTsKIAkJ CX0gZWxzZSB7Ci0JCQkJdm1fcGh5c19jcmVhdGVfc2VnKHBoeXNfYXZhaWxbaV0sCi0JCQkJICAg IHBoeXNfYXZhaWxbaSArIDFdLCBWTV9GUkVFTElTVF9ISUdITUVNKTsKKwkJCQl2bV9waHlzX2Ny ZWF0ZV9zZWcocGh5c19tYW5hZ2VkW2ldLAorCQkJCSAgICBwaHlzX21hbmFnZWRbaSArIDFdLCBW TV9GUkVFTElTVF9ISUdITUVNKTsKIAkJCX0KIAkJCWlmIChWTV9GUkVFTElTVF9ISUdITUVNID49 IHZtX25mcmVlbGlzdHMpCiAJCQkJdm1fbmZyZWVsaXN0cyA9IFZNX0ZSRUVMSVNUX0hJR0hNRU0g KyAxOwogCQl9IGVsc2UKICNlbmRpZgotCQl2bV9waHlzX2NyZWF0ZV9zZWcocGh5c19hdmFpbFtp XSwgcGh5c19hdmFpbFtpICsgMV0sCisJCXZtX3BoeXNfY3JlYXRlX3NlZyhwaHlzX21hbmFnZWRb aV0sIHBoeXNfbWFuYWdlZFtpICsgMV0sCiAJCSAgICBWTV9GUkVFTElTVF9ERUZBVUxUKTsKIAl9 CiAJZm9yIChkb20gPSAwOyBkb20gPCB2bV9uZG9tYWluczsgZG9tKyspIHsKSW5kZXg6IHN5cy92 bS92bV9yZXNlcnYuYwo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09Ci0tLSBzeXMvdm0vdm1fcmVzZXJ2LmMJKHJldmlzaW9u IDI3MjY4MikKKysrIHN5cy92bS92bV9yZXNlcnYuYwkod29ya2luZyBjb3B5KQpAQCAtODI0LDkg KzgyNCw5IEBACiAJICogSW5pdGlhbGl6ZSB0aGUgcmVzZXJ2YXRpb24gYXJyYXkuICBTcGVjaWZp Y2FsbHksIGluaXRpYWxpemUgdGhlCiAJICogInBhZ2VzIiBmaWVsZCBmb3IgZXZlcnkgZWxlbWVu dCB0aGF0IGhhcyBhbiB1bmRlcmx5aW5nIHN1cGVycGFnZS4KIAkgKi8KLQlmb3IgKGkgPSAwOyBw aHlzX2F2YWlsW2kgKyAxXSAhPSAwOyBpICs9IDIpIHsKLQkJcGFkZHIgPSByb3VuZHVwMihwaHlz X2F2YWlsW2ldLCBWTV9MRVZFTF8wX1NJWkUpOwotCQl3aGlsZSAocGFkZHIgKyBWTV9MRVZFTF8w X1NJWkUgPD0gcGh5c19hdmFpbFtpICsgMV0pIHsKKwlmb3IgKGkgPSAwOyBwaHlzX21hbmFnZWRb aSArIDFdICE9IDA7IGkgKz0gMikgeworCQlwYWRkciA9IHJvdW5kdXAyKHBoeXNfbWFuYWdlZFtp XSwgVk1fTEVWRUxfMF9TSVpFKTsKKwkJd2hpbGUgKHBhZGRyICsgVk1fTEVWRUxfMF9TSVpFIDw9 IHBoeXNfbWFuYWdlZFtpICsgMV0pIHsKIAkJCXZtX3Jlc2Vydl9hcnJheVtwYWRkciA+PiBWTV9M RVZFTF8wX1NISUZUXS5wYWdlcyA9CiAJCQkgICAgUEhZU19UT19WTV9QQUdFKHBhZGRyKTsKIAkJ CXBhZGRyICs9IFZNX0xFVkVMXzBfU0laRTsK --001a1133bf0229b31b0504eb1a08-- From owner-freebsd-arch@FreeBSD.ORG Wed Oct 8 19:07:24 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4EEA791E for ; Wed, 8 Oct 2014 19:07:24 +0000 (UTC) Received: from mail-pd0-f170.google.com (mail-pd0-f170.google.com [209.85.192.170]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1D0E0C54 for ; Wed, 8 Oct 2014 19:07:23 +0000 (UTC) Received: by mail-pd0-f170.google.com with SMTP id p10so7367934pdj.1 for ; Wed, 08 Oct 2014 12:07:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:content-type:mime-version:subject:from :in-reply-to:date:cc:message-id:references:to; bh=RY8PO0ce52ozOn2FgkMJmj6Gu40bax+XXpH4wN6m0Vg=; b=f6FpnwLUX+RqI6GR3u5Ezz6TOtQSGuX8QgvcY3cQGb3+R/BFAIaU5hc0wyA3gX+DN6 vxGE7rqtP3P5IvGHXIJSXKFZRUlCB6sRyEXDTT5O7w5T07b/JmGFJpknZc64GfTUChpG BOcMiPsbngpXZYGiNYIDrkjLSblNcm5OxnqvKYIFUVfbYS2ZhslgNkfQV7WcmtY1/8SA YvJwPNCa5ixX4TjNrETq5z3uzAJU2u0T2OqKwAYct5YQIUBOw33sIKywfXI2CbreMN2a ZOUtzAd3Tbwa9qVkG/BY4wNuENjlctq3duURvLoPTglcwDXKW6DdqXdI6spEijQXPOtN Zvdw== X-Gm-Message-State: ALoCoQn/5wkxPAa6JMkAxEUdtzuLCq6luQBcrajC2WyD+9rRsw5ajFSjaJzvvFi//992L2J/5hVH X-Received: by 10.70.37.208 with SMTP id a16mr11865917pdk.147.1412795228090; Wed, 08 Oct 2014 12:07:08 -0700 (PDT) Received: from [10.64.26.130] (dc1-prod.netflix.com. [69.53.236.251]) by mx.google.com with ESMTPSA id bz2sm722822pdb.17.2014.10.08.12.07.06 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 08 Oct 2014 12:07:07 -0700 (PDT) Sender: Warner Losh Content-Type: multipart/signed; boundary="Apple-Mail=_0F6CDCAF-3099-46DA-840A-D9195A00B846"; protocol="application/pgp-signature"; micalg=pgp-sha512 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: [rfc] enumerating device / bus domain information From: Warner Losh In-Reply-To: Date: Wed, 8 Oct 2014 13:07:04 -0600 Message-Id: <2975E3D3-0335-4739-9242-5733CCEE726C@bsdimp.com> References: To: Adrian Chadd X-Mailer: Apple Mail (2.1878.6) Cc: "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Oct 2014 19:07:24 -0000 --Apple-Mail=_0F6CDCAF-3099-46DA-840A-D9195A00B846 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 On Oct 7, 2014, at 7:37 PM, Adrian Chadd wrote: > Hi, >=20 > Right now we're not enumerating any NUMA domain information about = devices. >=20 > The more recent intel NUMA stuff has some extra affinity information > for devices that (eventually) will allow us to bind kernel/user > threads and/or memory allocation to devices to keep access local. > There's a penalty for DMAing in/out of remote memory, so we'll want to > figure out what counts as "Local" for memory allocation and perhaps > constrain the CPU set that worker threads for a device run on. >=20 > This patch adds a few things: >=20 > * it adds a bus_if.m method for fetching the VM domain ID of a given > device; or ENOENT if it's not in a VM domain; Maybe a default VM domain. All devices are in VM domains :) By default today, we have only one VM domain, and that=92s the model that most of = the code expects=85 Warner > * it adds some hooks to print the numa-domain out of a device if it = exists; > * it adds hooks in srat.c to store the original proximity-id values > and uses them to map PXM to FreeBSD VM domain IDs; > * the ACPI code now has bus methods to enumerate which PXM (and thus > which VM domain) a device is on. >=20 > The review for it is here: >=20 > https://reviews.freebsd.org/D906 >=20 > Please ignore the vm_phys.c patch; it's purely for experimenting on my > side and won't be committed part of this work. >=20 > Thanks, >=20 >=20 > -a > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to = "freebsd-arch-unsubscribe@freebsd.org" --Apple-Mail=_0F6CDCAF-3099-46DA-840A-D9195A00B846 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQIcBAEBCgAGBQJUNYtYAAoJEGwc0Sh9sBEAIm0P/RhhRgO39Ui2UNUwerJ1jHJ6 Y+bH52tCdhlHZ5sLYEI+BeyB1DiKvvsre0Tu6NI3fXgV5uGNAGBfOrwkHnuVg2vo wF+1g/4K4WRVh4lNA7ikP+z+vm6f7NqImJ0ESIKBG/KO1ESK+q9gMn9+ixajA3vH wO87HJwbOEC+H9LjjaW76oF5XUnBaXVcpNRPUqafCHSG6HYalxLSkNyH96A/EUEM q4GI1o8xAkto6t9NXQ86ZoPHIDzeVQDrFmTaDsMGlFigbFJGr/yyuuLRuadALVJf bzGff6hiPdR7rPM6rAb3dkabnCbND4hazytNiI+4jx45gN9DoyxcDjcuyseXha8D lxgrTe2Y2O49J+vkv8gBKy4bxSD8+BdiQTi4Nms4Tip/FcrKvgHS8UqoCSh/QgcR HTJCzzXv4wQfHp9eEhoSEaPCa3nqeqf0JZr+likbd+j4f1WHbpQ9eARfujbHk/bX 1BVll5AcMDqZQ95E8fc3ug2BUQOFjhZ1CzUMSs4PCx03oDquaYiZdk57aeAn3hl9 UCoP32UzIhbStqTZp9Od7CnvwZHtFUxnoR+jOBsrqpPeiy9G7NQuY4ppebU+7804 xmgL9OV/2YVZOod3WHgCgUAnQib1ANtiiSk4aGJ9NambQmTX1pOTVKUkCvHq22oq /qwi8/1i1ijtHFpR6AVj =o5yk -----END PGP SIGNATURE----- --Apple-Mail=_0F6CDCAF-3099-46DA-840A-D9195A00B846-- From owner-freebsd-arch@FreeBSD.ORG Wed Oct 8 23:12:59 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id F3744719 for ; Wed, 8 Oct 2014 23:12:58 +0000 (UTC) Received: from mail-ig0-x22c.google.com (mail-ig0-x22c.google.com [IPv6:2607:f8b0:4001:c05::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C0CA6CBF for ; Wed, 8 Oct 2014 23:12:58 +0000 (UTC) Received: by mail-ig0-f172.google.com with SMTP id r2so9342521igi.11 for ; Wed, 08 Oct 2014 16:12:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=XOBlh9Ij4rzIWjOaAeve8hXbA3F81PxT8cHk5dzV1JE=; b=A84yHkpzF+/vy++CGa5SgVkxgL246BqOL1eRBl1DG9zAA6eRwAnxvsI02q8ybg8uP/ ABWmnJBNoMyx7rShus/kTmES8lgHGzWYyGutFwwflMlfT0g0LXdmZJUWtAYgP3nKZaJ7 c0ssLDSO+pXavpKyB8RqQyIN6HNF0KWHCsfaDiK+iCxYWBnK6fmEfUXeyP+0aPk9fWdk +OmqXG51+uQon+DdzmI2ucxJRu7St98Em6wW8q+T5oBS/b/8UPmJpxPkgAhjpbMI/VDK lClRWELme7UUl1a9grSR0s+m7w9ZhKQPbn0O2gk3c2pSn9gVpvBENgh4nYeIwCTnNUZG QLSA== MIME-Version: 1.0 X-Received: by 10.50.43.137 with SMTP id w9mr48466603igl.36.1412809978057; Wed, 08 Oct 2014 16:12:58 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.50.78.4 with HTTP; Wed, 8 Oct 2014 16:12:57 -0700 (PDT) In-Reply-To: <2975E3D3-0335-4739-9242-5733CCEE726C@bsdimp.com> References: <2975E3D3-0335-4739-9242-5733CCEE726C@bsdimp.com> Date: Wed, 8 Oct 2014 16:12:57 -0700 X-Google-Sender-Auth: oZzYm2rnxcxTvJotnBECjb46t_k Message-ID: Subject: Re: [rfc] enumerating device / bus domain information From: Adrian Chadd To: Warner Losh Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Oct 2014 23:12:59 -0000 On 8 October 2014 12:07, Warner Losh wrote: > > On Oct 7, 2014, at 7:37 PM, Adrian Chadd wrote: > >> Hi, >> >> Right now we're not enumerating any NUMA domain information about device= s. >> >> The more recent intel NUMA stuff has some extra affinity information >> for devices that (eventually) will allow us to bind kernel/user >> threads and/or memory allocation to devices to keep access local. >> There's a penalty for DMAing in/out of remote memory, so we'll want to >> figure out what counts as "Local" for memory allocation and perhaps >> constrain the CPU set that worker threads for a device run on. >> >> This patch adds a few things: >> >> * it adds a bus_if.m method for fetching the VM domain ID of a given >> device; or ENOENT if it's not in a VM domain; > > Maybe a default VM domain. All devices are in VM domains :) By default > today, we have only one VM domain, and that=E2=80=99s the model that most= of the > code expects=E2=80=A6 Right, and that doesn't change until you compile in with num domains > 1. Then, CPUs and memory have VM domains, but devices may or may not have a VM domain. There's no "default" VM domain defined if num domains > 1. The devices themselves don't know about VM domains right now, so there's nothing constraining things like IRQ routing, CPU set, memory allocation, etc. The isilon team is working on extending the cpuset and allocators to "know" about numa and I'm sure this stuff will fall out of whatever they're working on. So when I go to add sysctl and other tree knowledge for device -> vm domain mapping I'm going to make them return -1 for "no domain." (Things will get pretty hilarious later on if we have devices that are "local" to two or more VM domains ..) -a From owner-freebsd-arch@FreeBSD.ORG Thu Oct 9 19:40:18 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 81C9B908; Thu, 9 Oct 2014 19:40:18 +0000 (UTC) Received: from mail-wi0-x236.google.com (mail-wi0-x236.google.com [IPv6:2a00:1450:400c:c05::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A5E85F26; Thu, 9 Oct 2014 19:40:17 +0000 (UTC) Received: by mail-wi0-f182.google.com with SMTP id n3so2865243wiv.15 for ; Thu, 09 Oct 2014 12:40:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=gqWzbpLv0ZAOXcUMKb8ISb+4xTYcvvw8HHucxv8VGUE=; b=iZ2ENz9O7ewXBvxMOfgrfSwpr8T6tThitvNpRAbzuhBmE2Doq+Wfr2g/2qzxUJt06Q uFRMy1q6iInlh2rQ/KYYEfAxKpsywSLUX0FTAwFOdaZdSdcguR7XZgksSPHR8eo4v7a+ m8PaRTA7y++vle5/gPPaSVQigDh/+LbS9xIgSSpapR3boIxD0VUczcjk/zJKS9fHFSSP Sggf9igZ1mOPnIVncMVTW7EVMAxPigqyKrZ6pAQ3x7I7+quZ5Dzz67u/gil+hHARlhR9 BjZa3Vrb53sx1nFGQUP37WjOc35Vcwh0FGbbyBVP1WTJJjHwT1OsbMzz8GOF5FVii1xd yv/w== MIME-Version: 1.0 X-Received: by 10.180.9.73 with SMTP id x9mr287751wia.20.1412883615902; Thu, 09 Oct 2014 12:40:15 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.216.106.136 with HTTP; Thu, 9 Oct 2014 12:40:15 -0700 (PDT) In-Reply-To: <20141009182310.GL2153@kib.kiev.ua> References: <201410090534.s995YTUx057314@svn.freebsd.org> <20141009182310.GL2153@kib.kiev.ua> Date: Thu, 9 Oct 2014 12:40:15 -0700 X-Google-Sender-Auth: hHZWZT4NoYgjIa_nON-KqsITkgg Message-ID: Subject: Re: svn commit: r272800 - head/sys/x86/acpica From: Adrian Chadd To: Konstantin Belousov , "freebsd-arch@freebsd.org" Content-Type: text/plain; charset=UTF-8 Cc: "svn-src-head@freebsd.org" , "svn-src-all@freebsd.org" , "src-committers@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Oct 2014 19:40:18 -0000 On 9 October 2014 11:23, Konstantin Belousov wrote: > I do not like it. Sorry for not looking at the web thing, I have very > little time. > > It never was an intention that one proximity domain reported by ACPI > was mapped to single VM domain. VM could split domains (in terms of > vm_domains) further for other reasons. Main motivation is that there > is 1:1 relations between domain/page queues/page queues locks/pagedaemons. > > I have patches in WIP stage which split firmware proximity domains > further, to decrease congestion on the page queue locks. I wrote about > this in the pgsql performance report. > > The short version is that there is/will be N:1 relation between VM domains > and proximity domains (which is reported by ACPI for devices). Hi, Well, we'll have to come up with an alternate design for all of this then. If we're going to actively define VM domains to be more than 1:1 VM domain to proximity domain then we're going to have to introduce proximity domains as a separate construct to the VM/NUMA system. (This is all fallout from this stuff not really being well defined and multiple people having differing ideas of what things may mean.) So let's flesh out what that's going to look like so we can mutate this interface and the general NUMA side of things into something that's useful. It may be enough to store the PXM map (renumbered to origin from 0 and be non-sparse) and then have a different mapping from PXM to VM domain. -a From owner-freebsd-arch@FreeBSD.ORG Fri Oct 10 03:54:02 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 014667F2 for ; Fri, 10 Oct 2014 03:54:01 +0000 (UTC) Received: from mail-pd0-f172.google.com (mail-pd0-f172.google.com [209.85.192.172]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C28EDA3D for ; Fri, 10 Oct 2014 03:54:01 +0000 (UTC) Received: by mail-pd0-f172.google.com with SMTP id ft15so888422pdb.31 for ; Thu, 09 Oct 2014 20:53:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:content-type:mime-version:subject:from :in-reply-to:date:cc:message-id:references:to; bh=mIs+v0GLFXUFvNbSNIsD2n2OEJPJOVdnXiZcsH96ldQ=; b=GN0SZtUkx4KNxYpd73jhd+U19RHlzGSAOjj2rpjQv49UMpwNwEOa2GSoKRHj4vqrDj M2tMYZ71W5/iQgl6QY8Twprs4H5sx7hef//tGZCpkp/7Ih+1DGYsuQCCtzvCvKO8wZ4t bjEqFYUGqiNnqL2XOcYeCulIi9AE1+n0LcmjSKheAZ0FEpCT0S1ABG9rNUX3VvkPFWBw SYpuI0FvJAnKccoca/x5yz6S71vzamvGsUdBsp0bmO8hbvvLc/7u+7UMPfSqYmhaB0+/ SgbssJRKYmUDweCQcowidrnpUzdCVlWQAfs3jaoBVYOVaGyq0FLKykreFYQPk8F8Kvyh gH3A== X-Gm-Message-State: ALoCoQmiRh0MSfnoCWiIr+1Ks0ih0xFI6tkKE979588HlVAw5sNYjPa8+vyu5CTp8l2Ft2nd/mtp X-Received: by 10.68.219.10 with SMTP id pk10mr2460211pbc.14.1412913235572; Thu, 09 Oct 2014 20:53:55 -0700 (PDT) Received: from [10.64.27.107] (dc1-prod.netflix.com. [69.53.236.251]) by mx.google.com with ESMTPSA id vf10sm1974695pbc.11.2014.10.09.20.53.54 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 09 Oct 2014 20:53:54 -0700 (PDT) Sender: Warner Losh Content-Type: multipart/signed; boundary="Apple-Mail=_5C657A39-8CEF-4768-80C7-AD7E7A5071B4"; protocol="application/pgp-signature"; micalg=pgp-sha512 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: [rfc] enumerating device / bus domain information From: Warner Losh In-Reply-To: Date: Thu, 9 Oct 2014 21:53:52 -0600 Message-Id: <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> References: <2975E3D3-0335-4739-9242-5733CCEE726C@bsdimp.com> To: Adrian Chadd X-Mailer: Apple Mail (2.1878.6) Cc: "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Oct 2014 03:54:02 -0000 --Apple-Mail=_5C657A39-8CEF-4768-80C7-AD7E7A5071B4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 On Oct 8, 2014, at 5:12 PM, Adrian Chadd wrote: > On 8 October 2014 12:07, Warner Losh wrote: >>=20 >> On Oct 7, 2014, at 7:37 PM, Adrian Chadd wrote: >>=20 >>> Hi, >>>=20 >>> Right now we're not enumerating any NUMA domain information about = devices. >>>=20 >>> The more recent intel NUMA stuff has some extra affinity information >>> for devices that (eventually) will allow us to bind kernel/user >>> threads and/or memory allocation to devices to keep access local. >>> There's a penalty for DMAing in/out of remote memory, so we'll want = to >>> figure out what counts as "Local" for memory allocation and perhaps >>> constrain the CPU set that worker threads for a device run on. >>>=20 >>> This patch adds a few things: >>>=20 >>> * it adds a bus_if.m method for fetching the VM domain ID of a given >>> device; or ENOENT if it's not in a VM domain; >>=20 >> Maybe a default VM domain. All devices are in VM domains :) By = default >> today, we have only one VM domain, and that=92s the model that most = of the >> code expects=85 >=20 > Right, and that doesn't change until you compile in with num domains > = 1. The first part of the statement doesn=92t change when the number of = domains is more than one. All devices are in a VM domain. > Then, CPUs and memory have VM domains, but devices may or may not have > a VM domain. There's no "default" VM domain defined if num domains > > 1. Please explain how a device cannot have a VM domain? For the terminology I'm familiar with, to even get cycles to the device, you = have to have a memory address (or an I/O port). That memory address has to necessarily map to some domain, even if that domain is equally sucky to get to from all CPUs (as is the case with I/O ports). while there may not be a =93default=94 domain, by virtue of its physical location it has = to have one. > The devices themselves don't know about VM domains right now, so > there's nothing constraining things like IRQ routing, CPU set, memory > allocation, etc. The isilon team is working on extending the cpuset > and allocators to "know" about numa and I'm sure this stuff will fall > out of whatever they're working on. Why would the device need to know the domain? Why aren=92t the IRQs, for example, steered to the appropriate CPU? Why doesn=92t the bus = handle allocating memory for it in the appropriate place? How does this = =93domain=94 tie into memory allocation and thread creation? > So when I go to add sysctl and other tree knowledge for device -> vm > domain mapping I'm going to make them return -1 for "no domain.=94 Seems like there=92s too many things lumped together here. First off, = how can there be no domain. That just hurts my brain. It has to be in some domain, or it can=92t be seen. Maybe this domain is one that sucks for = everybody to access, maybe it is one that=92s fast for some CPU or package of CPUs = to access, but it has to have a domain. > (Things will get pretty hilarious later on if we have devices that are > "local" to two or more VM domains ..) Well, devices aren=92t local to domains, per se. Devices can communicate = with other components in a system at a given cost. One NUMA model is =93near=94= vs =93far=94 where a single near domain exists and all the =93far=94 resources are = quite costly. Other NUMA models may have a wider range of costs so that some resources are = cheap, others are a little less cheap, while others are down right expensive = depending on how far across the fabric of interconnects the messages need to = travel. While one can model this as a full 1-1 partitioning, that doesn=92t match all = of the extant implementations, even today. It is easy, but an imperfect match to the = underlying realities in many cases (though a very good match to x86, which is = mostly what we care about). Warner --Apple-Mail=_5C657A39-8CEF-4768-80C7-AD7E7A5071B4 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQIcBAEBCgAGBQJUN1hQAAoJEGwc0Sh9sBEAZ+EP/18pDCCN8iON0ziWDSFutha8 eLm/2Z3Me32wGm+uiv6wXMvoCsu9oqpi8ULwheQIEZf6Ieh9RaCacIXeEzlAjO8u 1zEaVv6qXALkv8IEhtfbaesFElcnFCbAdYJG90GnmaFXdE0N9Z7oV/6C7M4nuIYq 82OgeziQ5UMAc8LPQxZyk2aDaHT7SrtB/A2Y+e+KBfiWgcHFjoiEQwlB4TT1gFC+ ycYJGlfkaEFmspilymVRUWSJkqhVSJFkn+0v6KMOtUCpxMvVDcIWyIUxAtg/wYt7 qnR+JDKYiS7fa5UGqfUDEZtJ2p2D10l4ziMelAOasUWfFtgi+2HDLP4GfBnvGQdq lu7cE1FPGsHNxMwuTi9nVegImYj8rJ4Uiec0kq1rIV1mukQS2V3vFADR/BSGViSr 7SZ2NFEf7CJND2246jxTaXoF4bKbYJilohd82FV3S1yAnj/UEONElbbDzMwfpIuS oWKFfF/ywau8A+qNp0EI6GjBDxLAmjK1cepSlDcTraQrrLgf6bUnTGhZYiujYk0p gGJtmkU+DMknKJFN5MouOTFpPHG7+KGvvbgpN5D9MuTqmYhqvDmuV+dhfRyi9zoT DAp7K5SuubwfuThUV8yjEAllE5Fv5q8wizCesZDZ1nRYTLmC8Z5EMbmk1lYVBmek YivD8gbWK1DE1cpLPBHy =Hphq -----END PGP SIGNATURE----- --Apple-Mail=_5C657A39-8CEF-4768-80C7-AD7E7A5071B4-- From owner-freebsd-arch@FreeBSD.ORG Fri Oct 10 08:23:53 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1FBF6D4A for ; Fri, 10 Oct 2014 08:23:53 +0000 (UTC) Received: from mail-wi0-x22b.google.com (mail-wi0-x22b.google.com [IPv6:2a00:1450:400c:c05::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id AB20F762 for ; Fri, 10 Oct 2014 08:23:52 +0000 (UTC) Received: by mail-wi0-f171.google.com with SMTP id em10so1248860wid.10 for ; Fri, 10 Oct 2014 01:23:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=YcXgowkSFkuwvyBzRon5ltCNkeCoAj0d3ZGvKLm52y0=; b=emhZpuKw9rRh3zRb4My4qDn+ZEHQFfobCzTBKHIFNiUEuZ7xMMHL3l/G+htOW9CP4U vXclC4HqFJzviHMN7bCDr5fBWRJA6xF11JeBTJsVADwpGeYDNEH01AdEIai0jQ+Xr2y7 zwMyaljEq8o0z4nvhshELLbd1G6CcHZCgd7LsFg5/h0z4SP/37i0t2a+adwQBWCbQ8Pf PjWnlouH3+gdQq+AiZWMvCgyeU3pZ9bf+7RC3+ThIfi0ns23oWsU0XbYPRWUc5NuH27F 5q/iRU5t6NRjV/9CgA/z4bKF2dN+F/Cc9aGVoAimYm7GLs4RbAMlM3SwCvGaVEPK762K td6A== MIME-Version: 1.0 X-Received: by 10.180.187.83 with SMTP id fq19mr2969270wic.59.1412929430771; Fri, 10 Oct 2014 01:23:50 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.216.106.136 with HTTP; Fri, 10 Oct 2014 01:23:50 -0700 (PDT) In-Reply-To: <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> References: <2975E3D3-0335-4739-9242-5733CCEE726C@bsdimp.com> <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> Date: Fri, 10 Oct 2014 01:23:50 -0700 X-Google-Sender-Auth: uEvNJY8-2PFTEnFwhtqRQh4xebE Message-ID: Subject: Re: [rfc] enumerating device / bus domain information From: Adrian Chadd To: Warner Losh Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Oct 2014 08:23:53 -0000 ... because for some topologies, as you've said, the devices are equal cost to all CPUs/memories. In the current parlance they don't have a VM domain and they're not assigned a proximity domain at all. We currently don't have a concept of "all domains" when compiling a kernel with a VM domain count of > 1. Now, should there be one? Yes. Should it be -1? Who knows. Let's have that discussion. Maybe we can just label -1 as the "default whole machine VM domain" and the behaviour won't change. And yes, I know about device costs. I was thinking about implementing this using the device/CPU/memory _PXM and for platforms with SLIT tables mapping out the actual costs to various other proximity domains. I'll eventually write a SLIT parser anyway as we may want the memory allocation to be "local" vs varying levels of "less local" rather than "local" vs "non-local". Now, whether the bus code can completely enumerate all of the requirements for allocating device local memory is another discussion. But drivers right now don't consistently at all say "please route stuff locally"; nor do they consistently say "please give me the cpuset of things that are local so I can allocate how many default queues and where to constrain them to run" - they just arbitrarily pick how much of what to run where and how. One simple version of the device/bus locality for this stuff is now in -HEAD. There's a little missing glue to do. Let's have the conversation of "how should drivers do this stuff and have the defaults behave consistently but be easily changed" so we can start experimenting with different ways to do all of this and move towards something that'll appear in -11. -a -a From owner-freebsd-arch@FreeBSD.ORG Fri Oct 10 15:58:08 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E96EE5B1; Fri, 10 Oct 2014 15:58:08 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A784FC53; Fri, 10 Oct 2014 15:58:08 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-70-85-31.nwrknj.fios.verizon.net [173.70.85.31]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 80837B9BD; Fri, 10 Oct 2014 11:58:07 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Subject: Re: [rfc] enumerating device / bus domain information Date: Fri, 10 Oct 2014 11:14:50 -0400 Message-ID: <4435143.bthBSP8NlX@ralph.baldwin.cx> User-Agent: KMail/4.12.5 (FreeBSD/10.1-BETA2; KDE/4.12.5; amd64; ; ) In-Reply-To: <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> References: <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Fri, 10 Oct 2014 11:58:07 -0400 (EDT) Cc: Adrian Chadd X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Oct 2014 15:58:09 -0000 On Thursday, October 09, 2014 09:53:52 PM Warner Losh wrote: > On Oct 8, 2014, at 5:12 PM, Adrian Chadd wrote: > > On 8 October 2014 12:07, Warner Losh wrote: > >> On Oct 7, 2014, at 7:37 PM, Adrian Chadd wrot= e: > >>> Hi, > >>>=20 > >>> Right now we're not enumerating any NUMA domain information about= > >>> devices. > >>>=20 > >>> The more recent intel NUMA stuff has some extra affinity informat= ion > >>> for devices that (eventually) will allow us to bind kernel/user > >>> threads and/or memory allocation to devices to keep access local.= > >>> There's a penalty for DMAing in/out of remote memory, so we'll wa= nt to > >>> figure out what counts as "Local" for memory allocation and perha= ps > >>> constrain the CPU set that worker threads for a device run on. > >>>=20 > >>> This patch adds a few things: > >>>=20 > >>> * it adds a bus_if.m method for fetching the VM domain ID of a gi= ven > >>> device; or ENOENT if it's not in a VM domain; > >>=20 > >> Maybe a default VM domain. All devices are in VM domains :) By def= ault > >> today, we have only one VM domain, and that=E2=80=99s the model th= at most of the > >> code expects=E2=80=A6 > >=20 > > Right, and that doesn't change until you compile in with num domain= s > 1. >=20 > The first part of the statement doesn=E2=80=99t change when the numbe= r of domains > is more than one. All devices are in a VM domain. >=20 > > Then, CPUs and memory have VM domains, but devices may or may not h= ave > > a VM domain. There's no "default" VM domain defined if num domains = > > > 1. >=20 > Please explain how a device cannot have a VM domain? For the > terminology I'm familiar with, to even get cycles to the device, you = have to > have a memory address (or an I/O port). That memory address has to > necessarily map to some domain, even if that domain is equally sucky = to get > to from all CPUs (as is the case with I/O ports). while there may not= be a > =E2=80=9Cdefault=E2=80=9D domain, by virtue of its physical location = it has to have one. >=20 > > The devices themselves don't know about VM domains right now, so > > there's nothing constraining things like IRQ routing, CPU set, memo= ry > > allocation, etc. The isilon team is working on extending the cpuset= > > and allocators to "know" about numa and I'm sure this stuff will fa= ll > > out of whatever they're working on. >=20 > Why would the device need to know the domain? Why aren=E2=80=99t the = IRQs, > for example, steered to the appropriate CPU? Why doesn=E2=80=99t the = bus handle > allocating memory for it in the appropriate place? How does this =E2=80= =9Cdomain=E2=80=9D > tie into memory allocation and thread creation? Because that's not what you always want (though it often is). However,= another reason is that system administrators want to know what devices are close to. You can sort of figure it out from devinfo on a modern x86 machine if you squint right, but isn't super obvious. I have a fol= lowup patch that adds a new per-device '%domain' sysctl node so that it is easier to see which domain a device is close to. In real-world experie= nce this can be useful as it lets a sysadmin/developer know which CPUs to schedule processes on. (Note that it doesn't always mean you put them close to the device. Sometimes you have processes that are more import= ant=20 than others, so you tie those close to the NIC and shove the other ones= over=20 to the "wrong" domain because you don't care if they have higher latenc= y.) > > So when I go to add sysctl and other tree knowledge for device -> v= m > > domain mapping I'm going to make them return -1 for "no domain.=E2=80= =9D >=20 > Seems like there=E2=80=99s too many things lumped together here. Firs= t off, how > can there be no domain. That just hurts my brain. It has to be in som= e > domain, or it can=E2=80=99t be seen. Maybe this domain is one that su= cks for > everybody to access, maybe it is one that=E2=80=99s fast for some CPU= or package of > CPUs to access, but it has to have a domain. They are not always tied to a single NUMA domain. On some dual-socket=20= Nehalem/Westmere class machines with per-CPU memory controllers (so 2 N= UMA=20 domains) you will have a single I/O hub that is directly connected to b= oth=20 CPUs. Thus, all memory in the system is equi-distant for I/O (but not = for CPU=20 access). The other problem is that you simply may not know. Not all BIOSes corr= ectly=20 communicate this information for devices. For example, certain 1U Roml= ey=20 servers I have worked with properly enumerate CPU <--> memory relations= hips in=20 the SRAT table, but they fail to include the necessary _PXM method in t= he top- level PCI bus devices (that correspond to the I/O hub). In that case,=20= returning a domain of 0 may very well be wrong. (In fact, for these=20= particular machines it mostly _is_ wrong as the expansion slots are all= tied=20 to NUMA domain 1, not 0.) > > (Things will get pretty hilarious later on if we have devices that = are > > "local" to two or more VM domains ..) >=20 > Well, devices aren=E2=80=99t local to domains, per se. Devices can co= mmunicate with > other components in a system at a given cost. One NUMA model is =E2=80= =9Cnear=E2=80=9D vs > =E2=80=9Cfar=E2=80=9D where a single near domain exists and all the =E2= =80=9Cfar=E2=80=9D resources are > quite costly. Other NUMA models may have a wider range of costs so th= at > some resources are cheap, others are a little less cheap, while other= s are > down right expensive depending on how far across the fabric of > interconnects the messages need to travel. While one can model this a= s a > full 1-1 partitioning, that doesn=E2=80=99t match all of the extant > implementations, even today. It is easy, but an imperfect match to th= e > underlying realities in many cases (though a very good match to x86, = which > is mostly what we care about). Even x86 already has a notion of multiple layers of cost. You can get = that=20 today if you buy a 4 socket Intel system. It seems you might also get = that if=20 you get a dual socket Haswell system with more than 8 cores per package= (due=20 to the funky split-brain thing on higher core count Haswells). I belie= ve AMD=20 also ships CPUs that contain 2 NUMA domains within a single physical pa= ckage=20 as well. Note that the I/O thing is becoming far more urgent in the past few yea= rs on=20 x86. With Nehalem/Westmere having I/O being remote or local didn't see= m to=20 matter very much (you could only measure very small differences in late= ncy or=20 throughput between the two scenarios in my experience). On Romley (San= dy=20 Bridge) and later it can be a very substantial difference in terms of b= oth=20 latency and throughput. --=20 John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Oct 10 18:07:07 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1E964F13; Fri, 10 Oct 2014 18:07:07 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B4DC9D09; Fri, 10 Oct 2014 18:07:06 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s9AI70eV035985 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 10 Oct 2014 21:07:00 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua s9AI70eV035985 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id s9AI702f035984; Fri, 10 Oct 2014 21:07:00 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 10 Oct 2014 21:07:00 +0300 From: Konstantin Belousov To: John Baldwin Subject: Re: [rfc] enumerating device / bus domain information Message-ID: <20141010180700.GS2153@kib.kiev.ua> References: <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> <4435143.bthBSP8NlX@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4435143.bthBSP8NlX@ralph.baldwin.cx> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: Adrian Chadd , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Oct 2014 18:07:07 -0000 On Fri, Oct 10, 2014 at 11:14:50AM -0400, John Baldwin wrote: > On Thursday, October 09, 2014 09:53:52 PM Warner Losh wrote: > > On Oct 8, 2014, at 5:12 PM, Adrian Chadd wrote: > > > On 8 October 2014 12:07, Warner Losh wrote: > > >> On Oct 7, 2014, at 7:37 PM, Adrian Chadd wrote: > > >>> Hi, > > >>> > > >>> Right now we're not enumerating any NUMA domain information about > > >>> devices. > > >>> > > >>> The more recent intel NUMA stuff has some extra affinity information > > >>> for devices that (eventually) will allow us to bind kernel/user > > >>> threads and/or memory allocation to devices to keep access local. > > >>> There's a penalty for DMAing in/out of remote memory, so we'll want to > > >>> figure out what counts as "Local" for memory allocation and perhaps > > >>> constrain the CPU set that worker threads for a device run on. > > >>> > > >>> This patch adds a few things: > > >>> > > >>> * it adds a bus_if.m method for fetching the VM domain ID of a given > > >>> device; or ENOENT if it's not in a VM domain; > > >> > > >> Maybe a default VM domain. All devices are in VM domains :) By default > > >> today, we have only one VM domain, and that???s the model that most of the > > >> code expects??? > > > > > > Right, and that doesn't change until you compile in with num domains > 1. > > > > The first part of the statement doesn???t change when the number of domains > > is more than one. All devices are in a VM domain. > > > > > Then, CPUs and memory have VM domains, but devices may or may not have > > > a VM domain. There's no "default" VM domain defined if num domains > > > > 1. > > > > Please explain how a device cannot have a VM domain? For the > > terminology I'm familiar with, to even get cycles to the device, you have to > > have a memory address (or an I/O port). That memory address has to > > necessarily map to some domain, even if that domain is equally sucky to get > > to from all CPUs (as is the case with I/O ports). while there may not be a > > ???default??? domain, by virtue of its physical location it has to have one. > > > > > The devices themselves don't know about VM domains right now, so > > > there's nothing constraining things like IRQ routing, CPU set, memory > > > allocation, etc. The isilon team is working on extending the cpuset > > > and allocators to "know" about numa and I'm sure this stuff will fall > > > out of whatever they're working on. > > > > Why would the device need to know the domain? Why aren???t the IRQs, > > for example, steered to the appropriate CPU? Why doesn???t the bus handle > > allocating memory for it in the appropriate place? How does this ???domain??? > > tie into memory allocation and thread creation? > > Because that's not what you always want (though it often is). However, > another reason is that system administrators want to know what devices > are close to. You can sort of figure it out from devinfo on a modern > x86 machine if you squint right, but isn't super obvious. I have a followup > patch that adds a new per-device '%domain' sysctl node so that it is > easier to see which domain a device is close to. In real-world experience > this can be useful as it lets a sysadmin/developer know which CPUs to > schedule processes on. (Note that it doesn't always mean you put them > close to the device. Sometimes you have processes that are more important > than others, so you tie those close to the NIC and shove the other ones over > to the "wrong" domain because you don't care if they have higher latency.) > > > > So when I go to add sysctl and other tree knowledge for device -> vm > > > domain mapping I'm going to make them return -1 for "no domain.??? > > > > Seems like there???s too many things lumped together here. First off, how > > can there be no domain. That just hurts my brain. It has to be in some > > domain, or it can???t be seen. Maybe this domain is one that sucks for > > everybody to access, maybe it is one that???s fast for some CPU or package of > > CPUs to access, but it has to have a domain. > > They are not always tied to a single NUMA domain. On some dual-socket > Nehalem/Westmere class machines with per-CPU memory controllers (so 2 NUMA > domains) you will have a single I/O hub that is directly connected to both > CPUs. Thus, all memory in the system is equi-distant for I/O (but not for CPU > access). > > The other problem is that you simply may not know. Not all BIOSes correctly > communicate this information for devices. For example, certain 1U Romley > servers I have worked with properly enumerate CPU <--> memory relationships in > the SRAT table, but they fail to include the necessary _PXM method in the top- > level PCI bus devices (that correspond to the I/O hub). In that case, > returning a domain of 0 may very well be wrong. (In fact, for these > particular machines it mostly _is_ wrong as the expansion slots are all tied > to NUMA domain 1, not 0.) > > > > (Things will get pretty hilarious later on if we have devices that are > > > "local" to two or more VM domains ..) > > > > Well, devices aren???t local to domains, per se. Devices can communicate with > > other components in a system at a given cost. One NUMA model is ???near??? vs > > ???far??? where a single near domain exists and all the ???far??? resources are > > quite costly. Other NUMA models may have a wider range of costs so that > > some resources are cheap, others are a little less cheap, while others are > > down right expensive depending on how far across the fabric of > > interconnects the messages need to travel. While one can model this as a > > full 1-1 partitioning, that doesn???t match all of the extant > > implementations, even today. It is easy, but an imperfect match to the > > underlying realities in many cases (though a very good match to x86, which > > is mostly what we care about). > > Even x86 already has a notion of multiple layers of cost. You can get that > today if you buy a 4 socket Intel system. It seems you might also get that if > you get a dual socket Haswell system with more than 8 cores per package (due > to the funky split-brain thing on higher core count Haswells). I believe AMD > also ships CPUs that contain 2 NUMA domains within a single physical package > as well. > > Note that the I/O thing is becoming far more urgent in the past few years on > x86. With Nehalem/Westmere having I/O being remote or local didn't seem to > matter very much (you could only measure very small differences in latency or > throughput between the two scenarios in my experience). On Romley (Sandy > Bridge) and later it can be a very substantial difference in terms of both > latency and throughput. This nicely augments my note of the unsuitability of the interface to return VM domain for the given device. I think that more correct is to return a bitset of the 'close enough' VM domains, where proximity is either explicitely asked by caller (like, belongs to, closer than two domains, etc) or just always return the best bitset. It would solve both the split proximity domains issue, and multi-uplink south bridge issue. Might be, it makes sense to add additional object layer of the HW proximity domain, which contain some set of VM domains, and function would return such HW proximity domain. From owner-freebsd-arch@FreeBSD.ORG Fri Oct 10 20:01:57 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 58D6EE78; Fri, 10 Oct 2014 20:01:57 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 30DC0BDF; Fri, 10 Oct 2014 20:01:57 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-70-85-31.nwrknj.fios.verizon.net [173.70.85.31]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id B96E1B94A; Fri, 10 Oct 2014 16:01:55 -0400 (EDT) From: John Baldwin To: Konstantin Belousov Subject: Re: [rfc] enumerating device / bus domain information Date: Fri, 10 Oct 2014 16:01:31 -0400 Message-ID: <4090343.RYS6GcFkXt@ralph.baldwin.cx> User-Agent: KMail/4.12.5 (FreeBSD/10.1-BETA2; KDE/4.12.5; amd64; ; ) In-Reply-To: <20141010180700.GS2153@kib.kiev.ua> References: <4435143.bthBSP8NlX@ralph.baldwin.cx> <20141010180700.GS2153@kib.kiev.ua> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Fri, 10 Oct 2014 16:01:55 -0400 (EDT) Cc: Adrian Chadd , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Oct 2014 20:01:57 -0000 On Friday, October 10, 2014 09:07:00 PM Konstantin Belousov wrote: > On Fri, Oct 10, 2014 at 11:14:50AM -0400, John Baldwin wrote: > > Even x86 already has a notion of multiple layers of cost. You can get that > > today if you buy a 4 socket Intel system. It seems you might also get > > that if you get a dual socket Haswell system with more than 8 cores per > > package (due to the funky split-brain thing on higher core count > > Haswells). I believe AMD also ships CPUs that contain 2 NUMA domains > > within a single physical package as well. > > > > Note that the I/O thing is becoming far more urgent in the past few years > > on x86. With Nehalem/Westmere having I/O being remote or local didn't > > seem to matter very much (you could only measure very small differences > > in latency or throughput between the two scenarios in my experience). On > > Romley (Sandy Bridge) and later it can be a very substantial difference > > in terms of both latency and throughput. > > This nicely augments my note of the unsuitability of the interface to > return VM domain for the given device. I think that more correct is > to return a bitset of the 'close enough' VM domains, where proximity > is either explicitely asked by caller (like, belongs to, closer than > two domains, etc) or just always return the best bitset. It would > solve both the split proximity domains issue, and multi-uplink south > bridge issue. > > Might be, it makes sense to add additional object layer of the HW proximity > domain, which contain some set of VM domains, and function would return > such HW proximity domain. I know Jeff has some sort of structure he wants to use for describing NUMA policies. Perhaps that is something that can be reused. However, we probably need to be further down the road to see what we actually need as our final interface here. In particular, I suspect we will have an orthogonal set of APIs to deal with CPU locality (i.e. Give me a cpuset of all CPUs in domain X or close to domain X, etc.). In as much as there are requests that are not bus-specific, I'd rather have drivers use those rather than having everything go through new-bus. (So that, for example, a multiqueue NIC driver could bind its queues to CPUs belonging to the same NUMA domain it is in rather than always using CPUs 0...N which is what all the Intel drivers do currently. Variations of this could also allow for more intelligent requests like "give me all CPUs close to N that are suitable for interrupts" which might include only one SMT thread per core.) Also, this is orthogonal to overloading the word "VM domain" to mean something that is a subset of a given NUMA domain. I think regardless that it probably makes sense to use a different term to describe more finely-grained partitions of NUMA domains. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Sat Oct 11 17:58:17 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 036FE6F5; Sat, 11 Oct 2014 17:58:17 +0000 (UTC) Received: from mail-la0-x236.google.com (mail-la0-x236.google.com [IPv6:2a00:1450:4010:c03::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 2AC8E338; Sat, 11 Oct 2014 17:58:15 +0000 (UTC) Received: by mail-la0-f54.google.com with SMTP id gm9so4906221lab.27 for ; Sat, 11 Oct 2014 10:58:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=k4aPXNhC9puvxX745aPjHytM9M3nuMibMpdbM6ns3Vo=; b=LKTAA4rcQL2O+pM+bR9ZtrwKQDzikqkT7dJuDyLqgkebiSm5xFl4MugHUwLQ9+lhqM PneTb06I32/F7No1of28HDkUAtg8s/CxVbtD7xVBI31PGeDZpZGsNdj8xABIYBez2cjK yK/SyJvLhii/nl4hswz4/TkvzKw28hGiE3hjq/PXR9So+dwdMhYdSlcpTqJUSBPlbU/S OLG6uJRUEAAVivMomxkvmYPDKYH3xXCcgljyDpHeH2Tb7DrLzJYCUpnOeRjk+FBydVUy UzxC5I6uV6tsZU5ImuygB8vjarOcKzW35bjBBDkVbQlsD/w1CuwtovjP75Cd/5gdAEdR n18w== MIME-Version: 1.0 X-Received: by 10.152.3.167 with SMTP id d7mr12815188lad.17.1413050293953; Sat, 11 Oct 2014 10:58:13 -0700 (PDT) Sender: crodr001@gmail.com Received: by 10.112.131.66 with HTTP; Sat, 11 Oct 2014 10:58:13 -0700 (PDT) Date: Sat, 11 Oct 2014 10:58:13 -0700 X-Google-Sender-Auth: maglq4jH-2LPOPBwiqzx-dileUY Message-ID: Subject: Enabling VIMAGE by default for FreeBSD 11? From: Craig Rodrigues To: freebsd-net@freebsd.org, "freebsd-virtualization@freebsd.org" , freebsd-arch Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Oct 2014 17:58:17 -0000 Hi, What action items are left to enable VIMAGE by default for FreeBSD 11? Not everyone uses bhyve, so VIMAGE is quite useful when using jails. -- Craig From owner-freebsd-arch@FreeBSD.ORG Sat Oct 11 20:20:34 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9AEDC9A; Sat, 11 Oct 2014 20:20:34 +0000 (UTC) Received: from mail.ipfw.ru (mail.ipfw.ru [IPv6:2a01:4f8:120:6141::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5CF6A2BA; Sat, 11 Oct 2014 20:20:34 +0000 (UTC) Received: from secured.by.ipfw.ru ([95.143.220.47] helo=[10.0.0.120]) by mail.ipfw.ru with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.82 (FreeBSD)) (envelope-from ) id 1Xcz9s-000CZl-CW; Sat, 11 Oct 2014 20:04:40 +0400 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: Enabling VIMAGE by default for FreeBSD 11? From: "Alexander V. Chernikov" In-Reply-To: Date: Sun, 12 Oct 2014 00:20:30 +0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: Craig Rodrigues X-Mailer: Apple Mail (2.1878.6) Cc: freebsd-net@freebsd.org, "freebsd-virtualization@freebsd.org" , freebsd-arch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Oct 2014 20:20:34 -0000 On 11 Oct 2014, at 21:58, Craig Rodrigues wrote: > Hi, >=20 > What action items are left to enable VIMAGE by default for FreeBSD 11? Are there any tests results showing performance implications on = different network-related workloads? >=20 > Not everyone uses bhyve, so VIMAGE is quite useful when using jails. >=20 > -- > Craig > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >=20