Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Apr 2005 19:12:20 +0200
From:      Jilles Tjoelker <jilles@stack.nl>
To:        Brian Fundakowski Feldman <green@freebsd.org>
Cc:        freebsd-current@freebsd.org
Subject:   Re: NFS client/buffer cache deadlock
Message-ID:  <20050420171220.GB93623@stack.nl>
In-Reply-To: <20050420155233.GJ1157@green.homeunix.org>
References:  <20050419160258.GA12287@stack.nl> <20050419160900.GB12287@stack.nl> <20050419161616.GF1157@green.homeunix.org> <20050419204723.GG1157@green.homeunix.org> <20050420140409.GA77731@stack.nl> <20050420142448.GH1157@green.homeunix.org> <20050420143842.GB77731@stack.nl> <20050420152038.GI1157@green.homeunix.org> <20050420153528.GC77731@stack.nl> <20050420155233.GJ1157@green.homeunix.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Apr 20, 2005 at 11:52:33AM -0400, Brian Fundakowski Feldman wrote:
> On Wed, Apr 20, 2005 at 05:35:28PM +0200, Marc Olzheim wrote:
> > On Wed, Apr 20, 2005 at 11:20:38AM -0400, Brian Fundakowski Feldman wrote:
> > > > Btw.: I'm not sure write(),writev() and pwrite() are allowed to do short
> > > > writes on regular files... ?

> > > Our manpage is incorrect; POSIX states that they are (see earlier
> > > e-mail).  There really is no alternative -- we simply can't build
> > > an NFS transaction larger than our buffer cache can accomodate.
> > > Note that short wries won't happen for normal buffer sizes, only
> > > excessively large ones.  I really don't believe that writev() is meant
> > > to be used so that you can write gigantic data structures in a single
> > > transaction...

It is ok to return partial success if the first chunk of a large write
succeeded and a later chunk failed persistently, but not if it cannot be
performed as a single NFS transaction.

> > Ah, I was reading the SUSv2 page:

> > http://www.opengroup.org/onlinepubs/009695399/functions/write.html

> > instead of the POSIX version.

> > But in neither of those I can extrude the fact that it can return
> > with result < nbyte, without it being a permanent condition.
> > What phrase makes you conclude that it can ?

> This specific issue is not clear-cut; the best thing to do lies somewhere
> within the range of these scenarios:

> "If a write() requests that more bytes be written than there is room
> for (for example, [XSI] [Option Start] the process' file size limit
> or [Option End] the physical end of a medium), only as many bytes as
> there is room for shall be written. For example, suppose there is
> space for 20 bytes more in a file before reaching a limit. A write of
> 512 bytes will return 20. The next write of a non-zero number of bytes
> would give a failure return (except as noted below)."

This only applies to permanent conditions.

> "When attempting to write to a file descriptor (other than a pipe or
> FIFO) that supports non-blocking writes and cannot accept the data
> immediately:

>     * If the O_NONBLOCK flag is clear, write() shall block the calling
>     thread until the data can be accepted.

>     * If the O_NONBLOCK flag is set, write() shall not block the
>     thread. If some data can be written without blocking the thread,
>     write() shall write what it can and return the number of bytes
>     written. Otherwise, it shall return -1 and set errno to [EAGAIN]."

I think regular files do not support non-blocking writes, even if they
are on NFS; in any case, O_NONBLOCK is disabled by default.

> "[ENOBUFS] Insufficient resources were available in the system to
> perform the operation."

> I think the first is more useful behavior than the last.  Supporting it
> should be exactly the same as supporting what happens if the actual
> filesystem fills up.  In this case, the filesystem is being requested to
> write more "than there is room for."

The filesystem filling up is a totally different case as attempting the
rest of the write is futile in that case.

In a lot of code, a short write() is treated as a (fairly) persistent
error.

-- 
Jilles Tjoelker



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050420171220.GB93623>