Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Oct 2011 14:25:59 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        arch@freebsd.org
Subject:   [PATCH] fadvise(2) system call
Message-ID:  <201110281426.00013.jhb@freebsd.org>

next in thread | raw e-mail | index | archive | help
I have been working for the last week or so on a patch to add an fadvise(2) 
system call.  It is somewhat similar to madvise(2) except that it operates on 
a file descriptor instead of a memory region.  It also only really makes sense 
for regular files and does not apply to other file descriptor types.

Just as with madvise(2) there are two types of advice that can be given.  One 
set specifies the access pattern for a specific region of the file while the 
second set result in immediate action.  The first set consist of FADV_NORMAL, 
FADV_SEQUENTIAL, FADV_RANDOM, and FADV_NOREUSE.  For these operations what I 
have done is to add an optional "advice region" to a file descriptor.  When a 
read(2) or write(2) is performed on a file, if the requested region falls 
completely within an active "advice region", then the associated advice is 
used to modify the IO_* flags passed down with the request.  FADV_NORMAL just 
uses the current IO_* flags including using sequential_heuristic() to 
determine the amount of read-ahead and/or clustering to perform.  FADV_RANDOM 
always passes a sequential count of zero to prevent read-ahead.  
FADV_SEQUENTIAL is the same as FADV_NORMAL for now (perhaps it should always 
be setting the maximum sequential count?).  FADV_NOREUSE passes a sequential 
count of zero and sets IO_DIRECT (as if the operation were performed on a file 
opened with O_DIRECT).

To simplify the implementation, only a single "advice region" is maintained 
for now (unlike madvise(2) which will split up vm map entries if necessary to 
ensure all requests are honored).  Since the advice is only advisory, I think 
this is an ok approach for now.  If we really had a valid use case, we could 
maybe add a list of advice regions, but then you have to deal with possibly 
splitting up read(2) or write(2) requests that span multiple advice regions, 
etc.  I didn't feel that this extra complexity was warranted for now.

The other two operations (FADV_WILLNEED and FADV_DONTNEED) are implemented via 
a new VOP_ADVISE().  The patch includes a default implementation 
(vop_stdadvise()) which is a nop for FADV_WILLNEED (I couldn't come up with a 
filesystem-independent way to trigger an async read-ahead).  For FADV_DONTNEED 
it has a functional implementation which flushes all clean buffers from the 
vnode (via a new V_CLEANONLY mode for vinvalbuf()) and then moves any clean, 
unwired pages in the specified range of the file to the cache page queue 
(using a new vm_object_page_cache() routine).

Various versions of this patch have already been reviewed and/or glanced at by
alc@, kib@, and mdf@, but I'd like to open it for wider review before 
committing it.  I will likely also MFC it back to 8 after 9.0 is released.

The patch can be found at www.freebsd.org/~jhb/patches/fadvise.patch

You can read the description of posix_fadvise() (which this implements) here:

http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201110281426.00013.jhb>