Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 2 Oct 2015 18:50:36 -0500
From:      Alan Cox <alc@rice.edu>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Mark Johnston <markj@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r288431 - in head/sys: kern sys vm
Message-ID:  <F3EF914A-8296-4833-BCF8-B9D878CAB80C@rice.edu>
In-Reply-To: <4276391.z2UvhhORjP@ralph.baldwin.cx>
References:  <201509302306.t8UN6UwX043736@repo.freebsd.org> <1837187.vUDrWYExQX@ralph.baldwin.cx> <20151002045842.GA18421@raichu> <4276391.z2UvhhORjP@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help

On Oct 2, 2015, at 10:59 AM, John Baldwin <jhb@freebsd.org> wrote:

> On Thursday, October 01, 2015 09:58:43 PM Mark Johnston wrote:
>> On Thu, Oct 01, 2015 at 09:32:45AM -0700, John Baldwin wrote:
>>> On Wednesday, September 30, 2015 11:06:30 PM Mark Johnston wrote:
>>>> Author: markj
>>>> Date: Wed Sep 30 23:06:29 2015
>>>> New Revision: 288431
>>>> URL: https://svnweb.freebsd.org/changeset/base/288431
>>>>=20
>>>> Log:
>>>>  As a step towards the elimination of PG_CACHED pages, rework the =
handling
>>>>  of POSIX_FADV_DONTNEED so that it causes the backing pages to be =
moved to
>>>>  the head of the inactive queue instead of being cached.
>>>>=20
>>>>  This affects the implementation of POSIX_FADV_NOREUSE as well, =
since it
>>>>  works by applying POSIX_FADV_DONTNEED to file ranges after they =
have been
>>>>  read or written.  At that point the corresponding buffers may =
still be
>>>>  dirty, so the previous implementation would coalesce successive =
ranges and
>>>>  apply POSIX_FADV_DONTNEED to the result, ensuring that pages =
backing the
>>>>  dirty buffers would eventually be cached.  To preserve this =
behaviour in an
>>>>  efficient manner, this change adds a new buf flag, B_NOREUSE, =
which causes
>>>>  the pages backing a VMIO buf to be placed at the head of the =
inactive queue
>>>>  when the buf is released.  POSIX_FADV_NOREUSE then works by =
setting this
>>>>  flag in bufs that underlie the specified range.
>>>=20
>>> Putting these pages back on the inactive queue completely defeats =
the primary
>>> purpose of DONTNEED and NOREUSE.  The primary purpose is to move the =
pages out
>>> of the VM object's tree of pages and into the free pool so that the =
application
>>> can instruct the VM to free memory more efficiently than relying on =
page daemon.
>>>=20
>>> The implementation used cache pages instead of free as a cheap =
optimization so
>>> that if an application did something dumb where it used DONTNEED and =
then turned
>>> around and read the file it would not have to go to disk if the =
pages had not
>>> yet been reused.  In practice this didn't work out so well because =
PG_CACHE pages
>>> don't really work well.
>>>=20
>>> However, using PG_CACHE was secondary to the primary purpose of =
explicitly freeing
>>> memory that an application knew wasn't going to be reused and =
avoiding the need
>>> for pagedaemon to run at all.  I think this should be freeing the =
pages instead of
>>> keeping them inactive.  If an application uses DONTNEED or NOREUSE =
and then turns
>>> around and rereads the file, it generally deserves to have to go to =
disk for it.
>>=20
>> A problem with this is that one application's DONTNEED or NOREUSE =
hint
>> would cause every application reading or writing that file to go to
>> disk, but posix_fadvise(2) is explicitly intended for applications =
that
>> wish to provide hints about their own access patterns. I realize that
>> it's typically used with application-private files, but that's not a
>> requirement of the interface. Deactivating (or caching) the backing
>> pages generally avoids this problem.
>=20
> I think it is not unreasonble to expect that fadvise() incurs =
system-wide
> affects.  A properly implemented WILLNEED that does read-ahead cannot =
work
> without incurring system-wide effects.  I had always assumed that =
fadvise()
> operated on a file, not a given process' view of a file (unlike, say,
> madvise which only operates on mappings and only indirectly affects
> file-backed data).
>=20


Can you elaborate on what you mean by =93I had always assumed that =
fadvise() operated on a file, =85=94?

Under the previous implementation, if you did an fadvise(DONTNEED) on a =
file, in order to cache the file=92s pages, those pages first had to be =
unmapped from any address space.  (You can find this unmapping performed =
by vm_page_try_to_cache().)  In other words, there was never any code =
that said, =93Is this a mapped page, and if it is, don=92t cache it =
because we=92re actually performing an fadvise().=94  So, to pick an =
extreme example, if you did an fadvise(=93libc.so=94, DONTNEED), unless =
some process had libc.so wired, then every single mapping to every =
single page of libc.so was going to be destroyed and the pages moved to =
the cache.  However, because we moved the pages to the cache (rather =
than freeing them), and libc.so is frequently accessed, a subsequent =
instruction fetch would have faulted and been able to reactivate the =
cached page, avoiding an I/O operation.  In other words, that we were =
caching the pages targeted by fadvise() rather than simply freeing them =
mattered in cases where the pages were in use/accessed by multiple =
processes.


>>> I'm pretty sure I had mentioned this to Alan before.  I believe that =
the idea is
>>> that pagedaemon should be cheap enough that having it run anyway =
shouldn't be an
>>> issue, but I'm a bit skeptical of that. :)  Lock contention is =
always possible and
>>> having DONTNEED/NOREUSE move pages to PG_CACHE avoided lock =
contention with
>>> pagedaemon during application page faults (since pagedaemon =
potentially never has
>>> to run).
>>=20
>> That's true, but the page queue locking (and the pagedaemon's
>> manipulation of the page queue locks) has also become more =
fine-grained
>> since posix_fadvise(2) was added. In particular, from some reading of
>> sys/vm in stable/8, inactive queue scans used to be performed with =
the
>> global page queue lock held; it was only dropped to launder dirty =
pages.
>> Now, the page queue lock is split into separate locks for the active =
and
>> inactive page queues, and the pagedaemon drops the inactive queue =
lock
>> for each page in all but a few exceptional cases. Does the =
optimization
>> of freeing or caching DONTNEED pages buy us all that much now?
>>=20
>> Some synthetic testing in which an application writes out many large
>> (2G) files and calls posix_fadvise(FADV_DONTNEED) after each one =
shows
>> no significant difference in runtime if the buffer pages are =
deactivated
>> vs. freed. (My test just modifies vfs_vmio_unwire() to treat =
B_NOREUSE
>> identically to B_DIRECT.) Unsurprisingly, I see very little lock
>> contention in the latter case, but in the former, most of the lock
>> contention is short (i.e. the mutex is acquired while spinning), and
>> a large majority of the contention is on the free page queue mutex. =
If
>> lock contention there is a concern, wouldn't it be better to try and
>> address that directly rather than by bypassing the pagedaemon?
>=20
> The lock contention was related to one process faulting in a new page =
due to
> a malloc() while pagedaemon ran.  Also, it wasn't a steady type of =
contention
> that would show up in an average.  Instead, it was the outliers (which =
in the
> case on 8.x were on the order of 2 seconds) that were problematic.  I =
used a
> hack to log "long" wait times for specific processes to both debug =
this and
> evaluate the solution.  I have a test program laying around from when =
I last
> tested this.  I'll see what I can reproduce (before it required a =
machine
> with at least 24GB of RAM to reproduce).
>=20
> The only foolproof way to reduce contention to zero is to eliminate =
one of
> the contending threads. :)  I do think there are situations where an
> application may be more informed about the optimal memory pattern for =
its
> workload than what the VM system can infer from heuristics.  Currently =
there
> is no other way to flush a file's contents from RAM.  If we had things =
like
> DONTNEED_I_MEAN_IT and DONTNEED_IM_NOT_SURE perhaps we could have a =
sliding
> scale, but at the moment the policy isn't that fine-grained.
>=20
> --=20
> John Baldwin
>=20
>=20




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F3EF914A-8296-4833-BCF8-B9D878CAB80C>