Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 Aug 2014 18:52:37 -0700
From:      Navdeep Parhar <np@FreeBSD.org>
To:        Adrian Chadd <adrian@freebsd.org>
Cc:        Alan Cox <alc@freebsd.org>, Victor Balada Diaz <victor@bsdes.net>, Sushanth Rai <sushanth_rai@yahoo.com>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: Support for zero copy sockets
Message-ID:  <53E97365.6040405@FreeBSD.org>
In-Reply-To: <CAJ-VmomEU7MAB1m_%2BQv9PMD6Yv9PbezDG2ncS1h1cQ1n_yQn=A@mail.gmail.com>
References:  <1407171616.44440.YahooMailBasic@web181702.mail.ne1.yahoo.com>	<20140811082610.GF7828@equilibrium.bsdes.net>	<CAJ-VmonTYPz7qJ3WG52ADp69FdYQkKQ6D_DOM1piEybGwgOmWA@mail.gmail.com>	<CAJUyCcOmxwBqbtWUgVLw1z%2BbNzrd9jw76GYeKWwDgv17=4g-kw@mail.gmail.com>	<53E91578.3060209@FreeBSD.org> <CAJ-VmomEU7MAB1m_%2BQv9PMD6Yv9PbezDG2ncS1h1cQ1n_yQn=A@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 08/11/14 17:42, Adrian Chadd wrote:
> On 11 August 2014 12:11, Navdeep Parhar <np@freebsd.org> wrote:
>> There is zero copy receive (aka Direct Data Placement -- DDP) in the TOE
>> driver that accompanies cxgbe(4).  I have a tx zero copy implementation
>> for it as well (this is not in -current right now).  But all this code
>> is chip specific and applies only to TCP connections that are handled
>> by the TOE driver.  It doesn't rely on COW or page flipping.
>>
>> The reason I'm mentioning all of this here is that if anyone is thinking
>> of working on proper zero copy awareness (and APIs) at the socket layer
>> then count me in as an interested party.
>
> I'm not going to get into it just for now, as I have enough on my
> FreeBSD plate to do already.

I'm in the same situation.

>
> However, the thing that always irked me about the hardware based
> solutions is that they're great for a subset of problems - typically
> small sets of sockets. The real interesting problem for me is how to
> make it work for say, 500,000 or more concurrent TCP sessions.

The hardware based solutions that I'm familiar with can handle tens of 
thousands of TCP sockets concurrently.  The protocol processing is 
entirely on the chip and when DDP is active the chip can DMA the payload 
straight to its final destination -- typically a userspace buffer.  The 
only VM operation involved is wiring and then unwiring the uio.

The complication is that the driver (cxgbe's t4_tom in this case) has 
absolutely no idea what an application does (blocking read vs. 
poll/select+read vs. aio_read vs. ...) so it makes some safe but 
suboptimal choices.  It would be nice if there were an API (very vaguely 
along the lines of madvise but for sockets, or maybe a sockopt knob) 
that an application could use to provide hints about its behavior.  We 
could also do with separate zero-copy flavors of the sosend/soreceive 
usrreqs.  And more hints (per read/write operation) that might let us 
avoid even the wire/unwire operation.

Anyway, let's save this discussion for later, when either of us has the 
time to come up with a specific set of proposals for -net and -arch.

Regards,
Navdeep

>
> I can see a method of doing zero-copy writes to the network stack -
> look at what the AIO code does in the physical IO path for doing
> writes. It wires down the memory and stuffs it into the buffer.
>
> The thing I haven't yet sorted out is what to do about mappings in
> case kernel code wants to peek at the socket data payload for whatever
> reason.
>
> (And yes, reads are still a problem.)
>
>
>
> -a
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53E97365.6040405>