Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 27 May 2003 22:54:53 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Igor Sysoev <is@rambler-co.ru>
Cc:        arch@freebsd.org
Subject:   Re: sendfile(2) SF_NOPUSH flag proposal
Message-ID:  <3ED44F2D.DAF1FA08@mindspring.com>
References:  <Pine.BSF.4.21.0305272137250.49494-100000@is>

next in thread | previous in thread | raw e-mail | index | archive | help
Igor Sysoev wrote:
> How do suppose to coelesce the file pages ? Wire two or more pages
> to mbuf's at once ?

It's done by the network driver, using the network card's DMA's
scatter/gather.


> Terry, I do not understand you.
> My argument is simple - I want to avoid the partial packets because it
> decreases the number of packets.  That's all.  There's nothing about
> amortized cost or total cost.  I do not even know what they are.

The total cost is the total overhead in packets to send a
given amount of data.  For a small amount of data, the total
cost is small, compared to the overhead involved in sending
the ethernet, IP, and TCP headers.

The amortized cost is how much an extra packet costs you to
send, relative to what you have to send anyway.  If you have
a lot of data to send, sending an extra packet or two is really
not very costly, since it's just one more packet out of hundreds.

If you argue there's a tiny amount of data, then the total
cost is important.

If you argue there's a lot of data, then the amortized cost
is important.

When you talk about extra packets being sent, you can't claim
that the amortized cost is important for a small amount of data,
or that the total cost is important for a huge amount of data.

Your focus on number of packets, rather than your ability to
move a total amount of data at or near the theoretical maximum,
makes no sense.


> > Actually, in this case, I'd just try to fix sendfile(2) to
> > do the packet coelescing I'd expect, given the relative
> > state of the TCP_NODELAY and TCP_NOPUSH options flags.
> 
> Actually, sendfile() already works according to TCP_NOPUSH flag.
> I do not know about TCP_NODELAY - I do not work with it.
> But if you turn TCP_NOPUSH on then sendfile() will send the full packets.
> If you turn TCP_NOPUSH off then sendfile() will send some packets partially
> filled. It's correct.

Sending some packets partially filled, instead of just the
last packet in a series partially filled, is *wrong*, IMO.


> > BTW: I'm still wary of the initial fault on the file data, if
> > it's not already in cache: arguably, it's better to start
> > sending the headers, and avoid the startup latency of delaying
> > sending the headers until the fault is satisfied: part of the
> > thing that's going to be eating your PCI bandwidth is the
> > disk I/O, and your disks are going to be the slowest data
> > sources/sinks in the whole equation.
> 
> I agree but after all it's 20ms or so delay.

Plus the delay for the NETISR.


> > In any case, I expect that this should be handled in the
> > context of TCP_NODELAY and TCP_NOPUSH, rather than by adding
> > options to work around an arguably broken sendfile(2).
> 
> sendfile() already works nice with TCP_NOPUSH.  I propose only the flags
> that allow to turn TCP_NOPUSH (actually TF_NOPUSH) on/off inside sendfile().
> Then in one syscall you can turn TCP_NOPUSH on, send the HTTP header, the file
> pages and turn TCP_NOPUSH off if all file pages are wired to mbuf's.
> And this TCP_NOPUSH state is not bound by sendfile() internals, you
> can control it via setsockopt/getsockopt(TCP_NOPUSH).

You're wrong about what TCP_NOPUSH is for; it's only for the
last packet of one system call being concatenated with the
first packet of another, to save empty packets between seperate
system calls.

When you call sendfile with a file, headers, and trailers, you
are making *only one system call*.

"man 4 tcp" tells us:

     TCP_NOPUSH    By convention, the sender-TCP will set the ``push'' bit and
                   begin transmission immediately (if permitted) at the end of
                   every user call to write(2) or writev(2).  The TCP_NOPUSH
                   option is provided to allow servers to easily make use of
                   Transaction TCP (see ttcp(4)).  When the option is set to a
                   non-zero value, TCP will delay sending any data at all
                   until either the socket is closed, or the internal send
                   buffer is filled.

FWIW, here's what it tells us about TCP_NODELAY:

     TCP_NODELAY   Under most circumstances, TCP sends data when it is pre-
                   sented; when outstanding data has not yet been acknowl-
                   edged, it gathers small amounts of output to be sent in a
                   single packet once an acknowledgement is received.  For a
                   small number of clients, such as window systems that send a
                   stream of mouse events which receive no replies, this pack-
                   etization may cause significant delays.  The boolean option
                   TCP_NODELAY defeats this algorithm.

IMO, sendfile(2) should be acting the way you want it to act
*just by you *NOT* setting TCP_NODELAY*.

If you *do* set TCP_NOPUSH, then it should delay sending the
last partial packet until the timer goes, or until you write(2),
writev(2), sendfile(2), or send/sendto/sendmsg(2) more data.

NOTE: TCP_NOPUSH *specifically* mentions writev(2), which, like
sendfile(2), takes data from multiple discrete buffers and sends
it.

Make sense now?  You think sendfile(2) needs options; I think
sendfile(2) is broken.

-- Terry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3ED44F2D.DAF1FA08>