FreeBSD Mail Archives

Date:      Fri, 08 Mar 2002 10:52:09 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Nate Williams <nate@yogotech.com>
Cc:        Julian Elischer <julian@elischer.org>, Poul-Henning Kamp <phk@critter.freebsd.dk>, arch@FreeBSD.ORG
Subject:   Re: Contemplating THIS change to signals. (fwd)
Message-ID:  <3C890859.4FB4F9D@mindspring.com>
References:  <15496.23508.148366.980354@caddis.yogotech.com> <Pine.BSF.4.21.0203080017330.46841-100000@InterJet.elischer.org> <15496.58430.16748.970354@caddis.yogotech.com>

Nate Williams wrote:
> > You'd be surprised then because once the send() is done, the network IO
> > will happen independently of the process.
> 
> I'm more thinking of send.  Once the send() system call has queued the
> data for sending, it's been 'sent' (ie; the stack has it, and will
> 'DTRT' with it).

The amount it queues onto the sockbuf is limited to the
available space.  Thus a send of a very large amount of
data means that only part of it gets queued for sending.

This is not a restartable situation, unless restart can
pick up where it left off, since the sending of a partial
load of data can modify peer state, as well, and that
state is not under the control of the user process on
your end.

So just returning an EINTR for a send (or sendfile, etc.)
means that the peer state is unknown.

At that point, you must abandon the connection.

While HTTP is tolerant to connection abandonment (and, if
both the client and the server support ranges, can even
recover automatically), things like FTP servers are not
(an abandoned connection will not result in an automatic
"reget" or "reput" for most every FTP).

> > this is no different.
> 
> Except for read() or recvfrom() system calls, and potentially things
> like 'sendfile()'.  Also, write() may behave differently (since write
> involve disk writing, not network writing).

Yes.  Sendfile is very sensitive, since it is a loop to
fill the socket buffer up to its limit (as well as the send
window), and interrupting this loop without saving the
current state damages it.

Actually, this sort of begs that the sendfile interface be
modified to take a context structure, wich is updated, so
that it can be resumed when interupted.  The context at the
time of interupt would need to reflect the reality of the
data that has been sent.

For the read/recvfrom, this also kind of begs for a "recvfile",
since there's no way you can modify them without futzing with
the POSIX-ly correctness of the interface.

> > from the time you do the ^Z to the time the syscall thinks of returning is
> > how long? If you say 3 seconds then all that is different is that in my
> > case the data has been taken off the queue but previously it would have
> > still been on the queue, but since the process is stopped,
> > who can tell?
> 
> A lot can happen in 3 seconds. :)

Or not happen.  The dichotomy between a gigabit link on a server
and a 28k link on a dialup client, damages a lot of the end-to-end
assumptions that interrupting with EINTR tries to make, since it
ignores the idea of a pool retention that is out of the control of
the sender, once the send is initiated.

For local disk, the problem is less (or at least, can be made less,
if you want to hack up uiomove and the write path), because you can
guarantee the relative atomicity of the operations.  If they are
initiated in block size increments on block boundaries, you can
actually make a 100% guarantee (some code mods to the current code
are required, but they are pretty trivial).  You won't avoid the
page-in-before-write-out in all cases, but you can avoid the case
of partial-write-complete-and-interrupted-leaving-interminate-state
case.

> > In fact if the data was already present then sleep(0 would
> > have never been called, so the blocking would (even now) happen
> > at the user boundary. All I'm doing is making it consitent.
> 
> Agreed.

Actually, this isn't true.  The wait for the window drain on a
socket write *does* result in a sleep.  Also the wait for a
subsequent page-in for a write spanning two pages without a
cluster adjacency on disk.

> Sorry, I meant 'kernel context' above.  My bad.  I'll repeat.
> 
> I'm still not getting a warm fuzzy that allowing the kernel context to
> complete and then block at the userland boundary is a good idea.  I'm
> not saying it's a bad idea, but I'm almost positive there are gremlins
> hiding in the details here. :)

What are very large gremlins called... "goblins"?

If they're there, then they're big.  Of course, the only
way to find out is to stop hypothesizing, and go look...

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C890859.4FB4F9D>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation