Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 20 Aug 2011 22:17:26 +0300
From:      Kostik Belousov <kostikbel@gmail.com>
To:        "Alexander V. Chernikov" <melifaro@ipfw.ru>
Cc:        alc@freebsd.org, freebsd-stable@freebsd.org, daniel@digsys.bg, perryh@pluto.rain.com, Alan Cox <alc@rice.edu>
Subject:   Re: 32GB limit per swap device?
Message-ID:  <20110820191726.GY17489@deviant.kiev.zoral.com.ua>
In-Reply-To: <4E500014.6030800@ipfw.ru>
References:  <4E4143A6.6030307@digsys.bg> <935F8EC2-88E0-45A3-BE8B-7210BE223BC5@mac.com> <4e42a0c0.e2t/9MF98O3HFjb1%perryh@pluto.rain.com> <4E4CCA6C.8020408@ipfw.ru> <CAJUyCcMc7m65c_XjHNFi0A4cHHySC1brLS7HdivstxeOi6uFQw@mail.gmail.com> <20110820174147.GW17489@deviant.kiev.zoral.com.ua> <4E4FFAD3.4090706@rice.edu> <4E500014.6030800@ipfw.ru>

next in thread | previous in thread | raw e-mail | index | archive | help

--fXc9gqH37d6mfFz8
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Aug 20, 2011 at 10:42:28PM +0400, Alexander V. Chernikov wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>=20
> Alan Cox wrote:
> > On 08/20/2011 12:41, Kostik Belousov wrote:
> >> On Sat, Aug 20, 2011 at 12:33:29PM -0500, Alan Cox wrote:
> >>> On Thu, Aug 18, 2011 at 3:16 AM, Alexander V.
> >>> Chernikov<melifaro@ipfw.ru>wrote:
> >>>
> >>>> On 10.08.2011 19:16, perryh@pluto.rain.com wrote:
> >>>>
> >>>>> Chuck Swiger<cswiger@mac.com>   wrote:
> >>>>>
> >>>>>   On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote:
> >>>>>>> I am trying to set up 64GB partitions for swap for a system that
> >>>>>>> has 64GB of RAM (with the idea to dump kernel core etc). But, on
> >>>>>>> 8-stable as of today I get:
> >>>>>>>
> >>>>>>> WARNING: reducing size to maximum of 67108864 blocks per swap unit
> >>>>>>>
> >>>>>>> Is there workaround for this limitation?
> >>>>>>>
> >>>> Another interesting question:
> >>>>
> >>>> swap pager operates in page blocks (PAGE_SIZE=3D4k on common arch).
> >>>>
> >>>> Block device size in passed to swaponsomething() in number of _disk_
> >>>> blocks
> >>>>   (e.g. in DEV_BSIZE=3D512). After that, kernel b-lists (on top of
> >>>> which swap
> >>>> pager is build) maximum objects check is enforced.
> >>>>
> >>>> The (possible) problem is that real object count we will operate on
> >>>> is not
> >>>> the value passed to swaponsomething() since it is calculated in
> >>>> wrong units.
> >>>>
> >>>> we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value
> >>>> which
> >>>> is rough (X / 8) so we should be able to address 32*8=3D256G.
> >>>>
> >>>> The code should look like this:
> >>>>
> >>>> Index: vm/swap_pager.c
> >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D**=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D**=3D=3D=3D=3D=3D=3D=3D
> >>>> --- vm/swap_pager.c     (revision 223877)
> >>>> +++ vm/swap_pager.c     (working copy)
> >>>> @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id,
> >>>> u_long
> >>>>         u_long mblocks;
> >>>>
> >>>>         /*
> >>>> +        * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd
> >>>> chunks.
> >>>> +        * First chop nblks off to page-align it, then convert.
> >>>> +        *
> >>>> +        * sw->sw_nblks is in page-sized chunks now too.
> >>>> +        */
> >>>> +       nblks&=3D ~(ctodb(1) - 1);
> >>>> +       nblks =3D dbtoc(nblks);
> >>>> +
> >>>> +       /*
> >>>>
> >>>>          * If we go beyond this, we get overflows in the radix
> >>>>          * tree bitmap code.
> >>>>          */
> >>>> @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id,
> >>>> u_long
> >>>>                         mblocks);
> >>>>                 nblks =3D mblocks;
> >>>>         }
> >>>> -       /*
> >>>> -        * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd
> >>>> chunks.
> >>>> -        * First chop nblks off to page-align it, then convert.
> >>>> -        *
> >>>> -        * sw->sw_nblks is in page-sized chunks now too.
> >>>> -        */
> >>>> -       nblks&=3D ~(ctodb(1) - 1);
> >>>> -       nblks =3D dbtoc(nblks);
> >>>>
> >>>>         sp =3D malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO);
> >>>>         sp->sw_vp =3D vp;
> >>>>
> >>>>
> >>>> (move pages recalculation before b-list check)
> >>>>
> >>>>
> >>>> Can someone comment on this?
> >>>>
> >>>>
> >>> I believe that you are correct.  Have you tried testing this change o=
n a
> >>> large swap device?
> I will try tomorrow.
>=20
> >> I probably agree too, but I am in the process of re-reading the swap
> >> code,
> >> and I do not quite believe in the limit.
> >>
> >=20
> > I'm uncertain whether the current limit, "0x40000000 /
> > BLIST_META_RADIX", is exact or not, but I doubt that it is too large.
>=20
> It is not exact.  It is rough estimation of
> sizeof(blmeta_t) * X < 4G (blist_create() assumes malloc() not being
> able to allocate more that 4G. I'm not sure if it is true this days)
> X is number of blocks we need to store. Actual number, however, it is X
> / (1 + 1/BLIST_META_RADIX + 1/BLIST_META_RADIX^2 + ...) but it dffers
> from X not very much.
>=20
> blist can be seen as tree of radix trees, with metainformation for all
> those radix trees allocated by single allocation which imposes this
> limit. Metatinformation is used to find free blocks more quickly
>=20
> Single linear allocation is required to advance to next radix tree on
> the same level very fast:
>=20
>=20
> *   *   *   *   *
> **  **  **  **  **
> ********************
> ^^^
> Some kind of schema with 3 level in tree and BLIST_META_RADIX=3D2 (instead
> of 16).
>=20
>=20
>=20
> >=20
> >> When the initial code was committed, our daddr_t was 32bit, I checked
> >> the RELENG_4 sources. Current code uses int64_t for daddr_t. My
> >> impression
> >> right now is that we only utilize the low 32bits of daddr_t.
> >>
> >> Esp. interesting looks the following typedef:
> >> typedef    uint32_t    u_daddr_t;    /* unsigned disk address */
> >> which (correctly) means that typical mask (u_daddr_t)-1 is 0xffffffff.
> >>
> >> I wonder whether we could just use full 64bit and de-facto remove the
> >> limitation on the swap partition size.
>=20
> This will increase struct blmeta_t twice and cause 2*X memory usage for
> every swap configuration.
No, daddr_t is already 64bit. Nothing will increase.
My point is the current limitation is artificial.

I think Alan note referred to the amount of the radix tree nodes
required to cover the large swap partition. But it could be a good
temporary measure.

I expect to be able to provide some numeric evidence later.
>=20
> >=20
> > I would rather argue first that the subr_list code should not be using
> > daddr_t all.  The code is abusing daddr_t and defining u_daddr_t to
> > represent things that are not disk addresses.  Instead, it should either
> > define its own type or directly use (u)int*_t.  Then, as for choosing
> > between 32 and 64 bits, I'm skeptical of using this structure for
> > managing more than 32 bits worth of blocks, given the amount of RAM it
> > will use.
> >=20
> >=20
> >=20
>=20
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.14 (FreeBSD)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>=20
> iEYEARECAAYFAk5QABQACgkQwcJ4iSZ1q2kdXwCfWPN48wauijoGOQCUaalYnFCR
> BIgAnRLCuDmPwySp1gd0xf+UPG5nC7KJ
> =3DsP6M
> -----END PGP SIGNATURE-----

--fXc9gqH37d6mfFz8
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (FreeBSD)

iEYEARECAAYFAk5QCEYACgkQC3+MBN1Mb4g3VQCfYlGrzdJOUw3Z2pL0mAWpb9fK
6hsAoLHoHVBteVjYBCRBEfRGCbACp6HU
=BGLI
-----END PGP SIGNATURE-----

--fXc9gqH37d6mfFz8--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110820191726.GY17489>