Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 29 Aug 2014 14:20:44 -0700
From:      Peter Wemm <peter@wemm.org>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        src-committers@freebsd.org, Alan Cox <alc@rice.edu>, svn-src-all@freebsd.org, Dmitry Morozovsky <marck@rinet.ru>, "Matthew D. Fuller" <fullermd@over-yonder.net>, svn-src-head@freebsd.org
Subject:   Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
Message-ID:  <2714752.cWQfguSlQD@overcee.wemm.org>
In-Reply-To: <0B77E782B5004AEBA77E6A5D16924D83@multiplay.co.uk>
References:  <201408281950.s7SJo90I047213@svn.freebsd.org> <64121723.0IFfex9X4X@overcee.wemm.org> <0B77E782B5004AEBA77E6A5D16924D83@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help

--nextPart15720028.E9rAG9uuRh
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="us-ascii"

On Friday 29 August 2014 21:42:15 Steven Hartland wrote:
> ----- Original Message -----
> From: "Peter Wemm" <peter@wemm.org>
>=20
> > On Friday 29 August 2014 20:51:03 Steven Hartland wrote:
> snip..
>=20
> > > Does Karl's explaination as to why this doesn't work above change=

> > > your
> > > mind?
> >=20
> > Actually no, I would expect the code as committed would *cause* the=

> > undesirable behavior that Karl described.
> >=20
> > ie: access a few large files and cause them to reside in cache.  Sa=
y
> > 50GB or so
> > on a 200G ram machine.  We now have the state where:
> >=20
> > v_cache =3D 50GB
> > v_free =3D 1MB
> >=20
> > The rest of the vm system looks at vm_paging_needed(), which is:  d=
o
> > we have
> > enough "v_cache + v_free"?  Since there's 50.001GB free, the answer=
 is
> > no.
> > It'll let v_free run right down to v_free_min because of the giant
> > pool of
> > v_cache just sitting there, waiting to be used.
> >=20
> > The zfs change, as committed will ignore all the free memory in the=

> > form of
> > v_cache.. and will be freaking out about how low v_free is getting =
and
> > will be
> > sacrificing ARC in order to put more memory into the v_free pool.
> >=20
> > As long as ARC keeps sacrificing itself this way, the free pages in=

> > the v_cache
> > pool won't get used.  When ARC finally runs out of pages to give up=
 to
> > v_free,
> > the kernel will start using the free pages from v_cache.  Eventuall=
y
> > it'll run
> > down that v_cache free pool and arc will be in a bare minimum state=

> > while this
> > is happening.
> >=20
> > Meanwhile, ZFS ARC will be crippled.  This has consequences - it do=
es
> > RCU like
> > things from ARC to keep fragmentation under control.  With ARC
> > crippled,
> > fragmentation will increase because there's less opportunistic
> > gathering of
> > data from ARC.
> >=20
> > Granted, you have to get things freed from active/inactive to the
> > cache state,
> > but once it's there, depending on the worlkload, it'll mess with AR=
C.
>=20
> There's already a vm_paging_needed() check in there below so this wil=
l
> already
> be dealt with will it not?

No.

If you read the code that you changed, you won't get that far. The v_fr=
ee test=20
comes before vm_paging_needed(), and if the v_free test triggers then A=
RC will=20
return pages and not look at the rest of the function.

If this function returns non-zerp, ARC is given back:

static int
arc_reclaim_needed(void)
{
        if (kmem_free_count() < zfs_arc_free_target) {
                return (1);
        }
         /*
         * Cooperate with pagedaemon when it's time for it to scan
         * and reclaim some pages.
         */
        if (vm_paging_needed()) {
                return (1);
        }

ie: if v_free (ignoring v_cache free pages) gets below the threshold, s=
top=20
evertyhing and discard ARC pages.=20

The vm_paging_needed() code is a NO-OP at this point. It can never retu=
rn=20
true.  Consider:
        vm_cnt.v_free_target =3D 4 * vm_cnt.v_free_min + vm_cnt.v_free_=
reserved;
vs
        vm_pageout_wakeup_thresh =3D (vm_cnt.v_free_min / 10) * 11;

zfs_arc_free_target defaults to vm_cnt.v_free_target, which is 400% of=20=

v_free_min, and compares it against the smaller v_free pool.

vm_paging_needed() compares the total free pool (v_free + v_cache) agai=
nst the=20
smaller wakeup threshold - 110% of v_free_min.

Comparing a larger value against a smaller target than the previous tes=
t will=20
never succeed unless you manually change the arc_free_target sysctl.


Also, what about the magic numbers here:
u_int zfs_arc_free_target =3D (1 << 19); /* default before pagedaemon i=
nit only=20
*/

That's half a million pages, or 2GB of physical ram on a 4K page size s=
ystem =20
How is this going to work on early boot in the machines in the cluster =
with=20
less than 2GB of ram?

=2D-=20
Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI=
6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246
--nextPart15720028.E9rAG9uuRh
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part.
Content-Transfer-Encoding: 7Bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAABAgAGBQJUAO6wAAoJEDXWlwnsgJ4EWGsH/25GwipkDGNwf9n3q5+CK8ri
jLK2Bs5kXAlz9w6lnd5QxlxHmOT4s/X2BTleepYZkDdDCSyyBftHBrOzzLzQ9Sh5
T/ZZWcC2ofkY6ih7QTrE6asgG8E1VZtOo70fCLwJ/b9kmWqI/TnEov/aVafu76cx
RJXTMHVju8pdbUzTSG77PHuCwCfl78T3MnW45tJgQrbLFHlUrR4ICT404fq0jbUA
gxNKj1ONUZJApS/sesPqI+ueLtBwaJbNwtKM03zXc29FTmJmg393SAlG9nrfVWvZ
J8Jhv809XhsRt2x0sAnyIlIdGy2mQ67cK17FYiaXQWJEjt5oTIGOghve8C7IqFU=
=T44y
-----END PGP SIGNATURE-----

--nextPart15720028.E9rAG9uuRh--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2714752.cWQfguSlQD>