From owner-freebsd-net@FreeBSD.ORG  Wed Sep 26 04:54:09 2007
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 644EC16A41A
	for <freebsd-net@FreeBSD.org>; Wed, 26 Sep 2007 04:54:09 +0000 (UTC)
	(envelope-from jmg@hydrogen.funkthat.com)
Received: from hydrogen.funkthat.com (gate.funkthat.com [69.17.45.168])
	by mx1.freebsd.org (Postfix) with ESMTP id 2B14513C458
	for <freebsd-net@FreeBSD.org>; Wed, 26 Sep 2007 04:54:09 +0000 (UTC)
	(envelope-from jmg@hydrogen.funkthat.com)
Received: from hydrogen.funkthat.com (gkvdzsioz46jvzap@localhost.funkthat.com
	[127.0.0.1])
	by hydrogen.funkthat.com (8.13.6/8.13.3) with ESMTP id l8Q4s261083635; 
	Tue, 25 Sep 2007 21:54:02 -0700 (PDT)
	(envelope-from jmg@hydrogen.funkthat.com)
Received: (from jmg@localhost)
	by hydrogen.funkthat.com (8.13.6/8.13.3/Submit) id l8Q4s2hA083634;
	Tue, 25 Sep 2007 21:54:02 -0700 (PDT) (envelope-from jmg)
Date: Tue, 25 Sep 2007 21:54:01 -0700
From: John-Mark Gurney <gurney_j@resnet.uoregon.edu>
To: Hans Petter Selasky <hselasky@c2i.net>
Message-ID: <20070926045401.GB47467@funkthat.com>
Mail-Followup-To: Hans Petter Selasky <hselasky@c2i.net>,
	freebsd-arch@FreeBSD.org
References: <200709260131.49156.hselasky@c2i.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200709260131.49156.hselasky@c2i.net>
User-Agent: Mutt/1.4.2.1i
X-Operating-System: FreeBSD 5.4-RELEASE-p6 i386
X-Files: The truth is out there
X-URL: http://resnet.uoregon.edu/~gurney_j/
X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0
	(hydrogen.funkthat.com [127.0.0.1]);
	Tue, 25 Sep 2007 21:54:03 -0700 (PDT)
X-Mailman-Approved-At: Wed, 26 Sep 2007 12:56:02 +0000
Cc: freebsd-arch@FreeBSD.org
Subject: Re: Request for feedback on common data backstore in the kernel
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: John-Mark Gurney <gurney_j@resnet.uoregon.edu>
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 04:54:09 -0000

Hans Petter Selasky wrote this message on Wed, Sep 26, 2007 at 01:31 +0200:
> Please keep me CC'ed, hence I'm not on all these lists.
> 
> In the kernel we currently have two different data backstores:
> 
> struct mbuf
> 
> and 
> 
> struct buf
> 
> These two backstores serve two different device types. "mbufs" are for network 
> devices and "buf" is for disk devices.

I don't see how this relates to the rest of your email, but even though
they are used similarly, their normal size is quite different...  mbufs
normally contain 64-256 byte packets, w/ large file transfers attaching
a 2k cluster (which comes from a different pool than the core mbuf) to
the mbuf...  buf is usually something like 16k-64k...

> Problem:
> 
> The current backstores are loaded into DMA by using the BUS-DMA framework. 
> This appears not to be too fast according to Kip Macy. See:
> 
> http://perforce.freebsd.org/chv.cgi?CH=126455

This only works on x86/amd64 because of the direct mapped memory that
they support..  This would complete break arches like sparc64 that
require an iommu to translate the addresses...  and also doesn't address
keeping the buffers in sync on arches like arm...   sparc64 may have many
gigs of memory, but only a 2GB window for mapping main memory...

It sounds like the x86/amd64 bus_dma implementation needs to be improved
to run more quickly...  As w/ all things, you can hardcode stuff, but then
you loose portability...

> Some ideas I have:
> 
> When a buffer is out out of range for a hardware device and a data-copy is 
> needed I want to simply copy that data in smaller parts to/from a 
> pre-allocated bounce buffer. I want to avoid allocating this buffer 
> when "bus_dmamap_load()" is called.
> 
> For pre-allocated USB DMA memory I currently have:
> 
> struct usbd_page
> 
> struct usbd_page {
>         void                    *buffer; // virtual address
>         bus_size_t              physaddr; // as seen by one of my devices
>         bus_dma_tag_t           tag;
>         bus_dmamap_t            map;
>         uint32_t                length;
> };
> 
> Mostly only "length == PAGE_SIZE" is allowed. When USB allocates DMA memory it 
> allocates the same size all the way and that is PAGE_SIZE bytes.

I could see attaching preallocated memory to a tag, and having maps
that attempt to use this memory, but that's something else...

> If two different PCI controllers want to communicate directly passing DMA 
> buffers, technically one would need to translate the physical address for 
> device 1 to the physical address as seen by device 2. If this translation 
> table is sorted, the search will be rather quick. Another approach is to 
> limit the number of translations:
> 
> #define N_MAX_PCI_TRANSLATE 4
> 
> struct usbd_page {
>         void                    *buffer; // virtual address
>         bus_size_t              physaddr[N_MAX_PCI_TRANSLATE];
>         bus_dma_tag_t           tag;
>         bus_dmamap_t            map;
>         uint32_t                length;
> };
> 
> Then PCI device 1 on bus X can use physaddr[0] and PCI device 2 on bus Y can 
> use physaddr[1]. If the physaddr[] is equal to some magic then the DMA buffer 
> is not reachable and must be bounced.
> 
> Then when two PCI devices talk together all they need to pass is a structure 
> like this:
> 
> struct usbd_page_cache {
>         struct usbd_page        *page_start;
>         uint32_t                page_offset_buf;
>         uint32_t                page_offset_end;
> };
> 
> And the required DMA address is looked up in some nanos.
> 
> Has someone been thinking about this topic before ?

There is no infastructure to support passing dma address between hardware
devices, and is complete unrelated to the issues raised above...  This
requires the ability to pass in a map to a tag and create a new map...
It is possible, as on the sun4v where you have two iommu's..  You'd have
to program on iommu to point to the other one, to support that...  But
it is rare to see devices to dma directly to each other...  You usually
end up dma'ing to main memory, and then having the other device dma it
out of memory..  The only time you need to dma between devices is if one
has local memory, and the other device is able to sanely populate it...
This is very rare...

Also, the PCI bus length can get quite long.. With PCIe, each device is
now it's own PCI bus, so you're starting to see PCI bus counts in the
10's and 20's, if not higher..  having an area of all of those, and
calculating them and filling them out sounds like a huge expense...

I'm a bit puzzeled as to what you wanted to solve, as the problem you
stated doesn't relate to the solutions you were thinking about...  Maybe
I'm missing something?  Can you give me an example of where cxgb is
writing to the memory on another pci bus, and not main memory?

P.S. I redirected to -arch as this seems more related than the other
lists...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."