From owner-freebsd-net@FreeBSD.ORG Wed Sep 26 04:54:09 2007 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 644EC16A41A for ; Wed, 26 Sep 2007 04:54:09 +0000 (UTC) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (gate.funkthat.com [69.17.45.168]) by mx1.freebsd.org (Postfix) with ESMTP id 2B14513C458 for ; Wed, 26 Sep 2007 04:54:09 +0000 (UTC) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (gkvdzsioz46jvzap@localhost.funkthat.com [127.0.0.1]) by hydrogen.funkthat.com (8.13.6/8.13.3) with ESMTP id l8Q4s261083635; Tue, 25 Sep 2007 21:54:02 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: (from jmg@localhost) by hydrogen.funkthat.com (8.13.6/8.13.3/Submit) id l8Q4s2hA083634; Tue, 25 Sep 2007 21:54:02 -0700 (PDT) (envelope-from jmg) Date: Tue, 25 Sep 2007 21:54:01 -0700 From: John-Mark Gurney To: Hans Petter Selasky Message-ID: <20070926045401.GB47467@funkthat.com> Mail-Followup-To: Hans Petter Selasky , freebsd-arch@FreeBSD.org References: <200709260131.49156.hselasky@c2i.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200709260131.49156.hselasky@c2i.net> User-Agent: Mutt/1.4.2.1i X-Operating-System: FreeBSD 5.4-RELEASE-p6 i386 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (hydrogen.funkthat.com [127.0.0.1]); Tue, 25 Sep 2007 21:54:03 -0700 (PDT) X-Mailman-Approved-At: Wed, 26 Sep 2007 12:56:02 +0000 Cc: freebsd-arch@FreeBSD.org Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: John-Mark Gurney List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 04:54:09 -0000 Hans Petter Selasky wrote this message on Wed, Sep 26, 2007 at 01:31 +0200: > Please keep me CC'ed, hence I'm not on all these lists. > > In the kernel we currently have two different data backstores: > > struct mbuf > > and > > struct buf > > These two backstores serve two different device types. "mbufs" are for network > devices and "buf" is for disk devices. I don't see how this relates to the rest of your email, but even though they are used similarly, their normal size is quite different... mbufs normally contain 64-256 byte packets, w/ large file transfers attaching a 2k cluster (which comes from a different pool than the core mbuf) to the mbuf... buf is usually something like 16k-64k... > Problem: > > The current backstores are loaded into DMA by using the BUS-DMA framework. > This appears not to be too fast according to Kip Macy. See: > > http://perforce.freebsd.org/chv.cgi?CH=126455 This only works on x86/amd64 because of the direct mapped memory that they support.. This would complete break arches like sparc64 that require an iommu to translate the addresses... and also doesn't address keeping the buffers in sync on arches like arm... sparc64 may have many gigs of memory, but only a 2GB window for mapping main memory... It sounds like the x86/amd64 bus_dma implementation needs to be improved to run more quickly... As w/ all things, you can hardcode stuff, but then you loose portability... > Some ideas I have: > > When a buffer is out out of range for a hardware device and a data-copy is > needed I want to simply copy that data in smaller parts to/from a > pre-allocated bounce buffer. I want to avoid allocating this buffer > when "bus_dmamap_load()" is called. > > For pre-allocated USB DMA memory I currently have: > > struct usbd_page > > struct usbd_page { > void *buffer; // virtual address > bus_size_t physaddr; // as seen by one of my devices > bus_dma_tag_t tag; > bus_dmamap_t map; > uint32_t length; > }; > > Mostly only "length == PAGE_SIZE" is allowed. When USB allocates DMA memory it > allocates the same size all the way and that is PAGE_SIZE bytes. I could see attaching preallocated memory to a tag, and having maps that attempt to use this memory, but that's something else... > If two different PCI controllers want to communicate directly passing DMA > buffers, technically one would need to translate the physical address for > device 1 to the physical address as seen by device 2. If this translation > table is sorted, the search will be rather quick. Another approach is to > limit the number of translations: > > #define N_MAX_PCI_TRANSLATE 4 > > struct usbd_page { > void *buffer; // virtual address > bus_size_t physaddr[N_MAX_PCI_TRANSLATE]; > bus_dma_tag_t tag; > bus_dmamap_t map; > uint32_t length; > }; > > Then PCI device 1 on bus X can use physaddr[0] and PCI device 2 on bus Y can > use physaddr[1]. If the physaddr[] is equal to some magic then the DMA buffer > is not reachable and must be bounced. > > Then when two PCI devices talk together all they need to pass is a structure > like this: > > struct usbd_page_cache { > struct usbd_page *page_start; > uint32_t page_offset_buf; > uint32_t page_offset_end; > }; > > And the required DMA address is looked up in some nanos. > > Has someone been thinking about this topic before ? There is no infastructure to support passing dma address between hardware devices, and is complete unrelated to the issues raised above... This requires the ability to pass in a map to a tag and create a new map... It is possible, as on the sun4v where you have two iommu's.. You'd have to program on iommu to point to the other one, to support that... But it is rare to see devices to dma directly to each other... You usually end up dma'ing to main memory, and then having the other device dma it out of memory.. The only time you need to dma between devices is if one has local memory, and the other device is able to sanely populate it... This is very rare... Also, the PCI bus length can get quite long.. With PCIe, each device is now it's own PCI bus, so you're starting to see PCI bus counts in the 10's and 20's, if not higher.. having an area of all of those, and calculating them and filling them out sounds like a huge expense... I'm a bit puzzeled as to what you wanted to solve, as the problem you stated doesn't relate to the solutions you were thinking about... Maybe I'm missing something? Can you give me an example of where cxgb is writing to the memory on another pci bus, and not main memory? P.S. I redirected to -arch as this seems more related than the other lists... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."