Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Apr 2005 13:28:39 -0400
From:      Brian Fundakowski Feldman <green@freebsd.org>
To:        Jilles Tjoelker <jilles@stack.nl>
Cc:        freebsd-current@freebsd.org
Subject:   Re: NFS client/buffer cache deadlock
Message-ID:  <20050420172839.GK1157@green.homeunix.org>
In-Reply-To: <20050420171220.GB93623@stack.nl>
References:  <20050419160900.GB12287@stack.nl> <20050419161616.GF1157@green.homeunix.org> <20050419204723.GG1157@green.homeunix.org> <20050420140409.GA77731@stack.nl> <20050420142448.GH1157@green.homeunix.org> <20050420143842.GB77731@stack.nl> <20050420152038.GI1157@green.homeunix.org> <20050420153528.GC77731@stack.nl> <20050420155233.GJ1157@green.homeunix.org> <20050420171220.GB93623@stack.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Apr 20, 2005 at 07:12:20PM +0200, Jilles Tjoelker wrote:
> On Wed, Apr 20, 2005 at 11:52:33AM -0400, Brian Fundakowski Feldman wrote:
> > On Wed, Apr 20, 2005 at 05:35:28PM +0200, Marc Olzheim wrote:
> > > On Wed, Apr 20, 2005 at 11:20:38AM -0400, Brian Fundakowski Feldman wrote:
> > > > > Btw.: I'm not sure write(),writev() and pwrite() are allowed to do short
> > > > > writes on regular files... ?
> 
> > > > Our manpage is incorrect; POSIX states that they are (see earlier
> > > > e-mail).  There really is no alternative -- we simply can't build
> > > > an NFS transaction larger than our buffer cache can accomodate.
> > > > Note that short wries won't happen for normal buffer sizes, only
> > > > excessively large ones.  I really don't believe that writev() is meant
> > > > to be used so that you can write gigantic data structures in a single
> > > > transaction...
> 
> It is ok to return partial success if the first chunk of a large write
> succeeded and a later chunk failed persistently, but not if it cannot be
> performed as a single NFS transaction.

What is your rationale for this?

> > > Ah, I was reading the SUSv2 page:
> 
> > > http://www.opengroup.org/onlinepubs/009695399/functions/write.html
> 
> > > instead of the POSIX version.
> 
> > > But in neither of those I can extrude the fact that it can return
> > > with result < nbyte, without it being a permanent condition.
> > > What phrase makes you conclude that it can ?
> 
> > This specific issue is not clear-cut; the best thing to do lies somewhere
> > within the range of these scenarios:
> 
> > "If a write() requests that more bytes be written than there is room
> > for (for example, [XSI] [Option Start] the process' file size limit
> > or [Option End] the physical end of a medium), only as many bytes as
> > there is room for shall be written. For example, suppose there is
> > space for 20 bytes more in a file before reaching a limit. A write of
> > 512 bytes will return 20. The next write of a non-zero number of bytes
> > would give a failure return (except as noted below)."
> 
> This only applies to permanent conditions.
> 
> > "When attempting to write to a file descriptor (other than a pipe or
> > FIFO) that supports non-blocking writes and cannot accept the data
> > immediately:
> 
> >     * If the O_NONBLOCK flag is clear, write() shall block the calling
> >     thread until the data can be accepted.
> 
> >     * If the O_NONBLOCK flag is set, write() shall not block the
> >     thread. If some data can be written without blocking the thread,
> >     write() shall write what it can and return the number of bytes
> >     written. Otherwise, it shall return -1 and set errno to [EAGAIN]."
> 
> I think regular files do not support non-blocking writes, even if they
> are on NFS; in any case, O_NONBLOCK is disabled by default.

POSIX does not specify O_NONBLOCK semantics for regular files.  This
means we can do whatever is most useful.

> > "[ENOBUFS] Insufficient resources were available in the system to
> > perform the operation."
> 
> > I think the first is more useful behavior than the last.  Supporting it
> > should be exactly the same as supporting what happens if the actual
> > filesystem fills up.  In this case, the filesystem is being requested to
> > write more "than there is room for."
> 
> The filesystem filling up is a totally different case as attempting the
> rest of the write is futile in that case.

No, it isn't.  The filesystem may be not-full again soon, possibly
even what the program might consider "immediately".

> In a lot of code, a short write() is treated as a (fairly) persistent
> error.

I mentioned this several e-mails ago.  Plenty of software is also not
going to understand ENOBUFS.

-- 
Brian Fundakowski Feldman                           \'[ FreeBSD ]''''''''''\
  <> green@FreeBSD.org                               \  The Power to Serve! \
 Opinions expressed are my own.                       \,,,,,,,,,,,,,,,,,,,,,,\



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050420172839.GK1157>