From owner-freebsd-stable@FreeBSD.ORG Sat Aug 20 19:17:39 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8383A1065700; Sat, 20 Aug 2011 19:17:39 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id E54DF8FC1B; Sat, 20 Aug 2011 19:17:38 +0000 (UTC) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id p7KJHQON072993 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 20 Aug 2011 22:17:26 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4) with ESMTP id p7KJHQxE011642; Sat, 20 Aug 2011 22:17:26 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4/Submit) id p7KJHQoC011641; Sat, 20 Aug 2011 22:17:26 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 20 Aug 2011 22:17:26 +0300 From: Kostik Belousov To: "Alexander V. Chernikov" Message-ID: <20110820191726.GY17489@deviant.kiev.zoral.com.ua> References: <4E4143A6.6030307@digsys.bg> <935F8EC2-88E0-45A3-BE8B-7210BE223BC5@mac.com> <4e42a0c0.e2t/9MF98O3HFjb1%perryh@pluto.rain.com> <4E4CCA6C.8020408@ipfw.ru> <20110820174147.GW17489@deviant.kiev.zoral.com.ua> <4E4FFAD3.4090706@rice.edu> <4E500014.6030800@ipfw.ru> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="fXc9gqH37d6mfFz8" Content-Disposition: inline In-Reply-To: <4E500014.6030800@ipfw.ru> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-3.3 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00, DNS_FROM_OPENWHOIS autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: alc@freebsd.org, freebsd-stable@freebsd.org, daniel@digsys.bg, perryh@pluto.rain.com, Alan Cox Subject: Re: 32GB limit per swap device? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Aug 2011 19:17:39 -0000 --fXc9gqH37d6mfFz8 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Aug 20, 2011 at 10:42:28PM +0400, Alexander V. Chernikov wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 >=20 > Alan Cox wrote: > > On 08/20/2011 12:41, Kostik Belousov wrote: > >> On Sat, Aug 20, 2011 at 12:33:29PM -0500, Alan Cox wrote: > >>> On Thu, Aug 18, 2011 at 3:16 AM, Alexander V. > >>> Chernikovwrote: > >>> > >>>> On 10.08.2011 19:16, perryh@pluto.rain.com wrote: > >>>> > >>>>> Chuck Swiger wrote: > >>>>> > >>>>> On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote: > >>>>>>> I am trying to set up 64GB partitions for swap for a system that > >>>>>>> has 64GB of RAM (with the idea to dump kernel core etc). But, on > >>>>>>> 8-stable as of today I get: > >>>>>>> > >>>>>>> WARNING: reducing size to maximum of 67108864 blocks per swap unit > >>>>>>> > >>>>>>> Is there workaround for this limitation? > >>>>>>> > >>>> Another interesting question: > >>>> > >>>> swap pager operates in page blocks (PAGE_SIZE=3D4k on common arch). > >>>> > >>>> Block device size in passed to swaponsomething() in number of _disk_ > >>>> blocks > >>>> (e.g. in DEV_BSIZE=3D512). After that, kernel b-lists (on top of > >>>> which swap > >>>> pager is build) maximum objects check is enforced. > >>>> > >>>> The (possible) problem is that real object count we will operate on > >>>> is not > >>>> the value passed to swaponsomething() since it is calculated in > >>>> wrong units. > >>>> > >>>> we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value > >>>> which > >>>> is rough (X / 8) so we should be able to address 32*8=3D256G. > >>>> > >>>> The code should look like this: > >>>> > >>>> Index: vm/swap_pager.c > >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D**=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D**=3D=3D=3D=3D=3D=3D=3D > >>>> --- vm/swap_pager.c (revision 223877) > >>>> +++ vm/swap_pager.c (working copy) > >>>> @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id, > >>>> u_long > >>>> u_long mblocks; > >>>> > >>>> /* > >>>> + * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd > >>>> chunks. > >>>> + * First chop nblks off to page-align it, then convert. > >>>> + * > >>>> + * sw->sw_nblks is in page-sized chunks now too. > >>>> + */ > >>>> + nblks&=3D ~(ctodb(1) - 1); > >>>> + nblks =3D dbtoc(nblks); > >>>> + > >>>> + /* > >>>> > >>>> * If we go beyond this, we get overflows in the radix > >>>> * tree bitmap code. > >>>> */ > >>>> @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id, > >>>> u_long > >>>> mblocks); > >>>> nblks =3D mblocks; > >>>> } > >>>> - /* > >>>> - * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd > >>>> chunks. > >>>> - * First chop nblks off to page-align it, then convert. > >>>> - * > >>>> - * sw->sw_nblks is in page-sized chunks now too. > >>>> - */ > >>>> - nblks&=3D ~(ctodb(1) - 1); > >>>> - nblks =3D dbtoc(nblks); > >>>> > >>>> sp =3D malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO); > >>>> sp->sw_vp =3D vp; > >>>> > >>>> > >>>> (move pages recalculation before b-list check) > >>>> > >>>> > >>>> Can someone comment on this? > >>>> > >>>> > >>> I believe that you are correct. Have you tried testing this change o= n a > >>> large swap device? > I will try tomorrow. >=20 > >> I probably agree too, but I am in the process of re-reading the swap > >> code, > >> and I do not quite believe in the limit. > >> > >=20 > > I'm uncertain whether the current limit, "0x40000000 / > > BLIST_META_RADIX", is exact or not, but I doubt that it is too large. >=20 > It is not exact. It is rough estimation of > sizeof(blmeta_t) * X < 4G (blist_create() assumes malloc() not being > able to allocate more that 4G. I'm not sure if it is true this days) > X is number of blocks we need to store. Actual number, however, it is X > / (1 + 1/BLIST_META_RADIX + 1/BLIST_META_RADIX^2 + ...) but it dffers > from X not very much. >=20 > blist can be seen as tree of radix trees, with metainformation for all > those radix trees allocated by single allocation which imposes this > limit. Metatinformation is used to find free blocks more quickly >=20 > Single linear allocation is required to advance to next radix tree on > the same level very fast: >=20 >=20 > * * * * * > ** ** ** ** ** > ******************** > ^^^ > Some kind of schema with 3 level in tree and BLIST_META_RADIX=3D2 (instead > of 16). >=20 >=20 >=20 > >=20 > >> When the initial code was committed, our daddr_t was 32bit, I checked > >> the RELENG_4 sources. Current code uses int64_t for daddr_t. My > >> impression > >> right now is that we only utilize the low 32bits of daddr_t. > >> > >> Esp. interesting looks the following typedef: > >> typedef uint32_t u_daddr_t; /* unsigned disk address */ > >> which (correctly) means that typical mask (u_daddr_t)-1 is 0xffffffff. > >> > >> I wonder whether we could just use full 64bit and de-facto remove the > >> limitation on the swap partition size. >=20 > This will increase struct blmeta_t twice and cause 2*X memory usage for > every swap configuration. No, daddr_t is already 64bit. Nothing will increase. My point is the current limitation is artificial. I think Alan note referred to the amount of the radix tree nodes required to cover the large swap partition. But it could be a good temporary measure. I expect to be able to provide some numeric evidence later. >=20 > >=20 > > I would rather argue first that the subr_list code should not be using > > daddr_t all. The code is abusing daddr_t and defining u_daddr_t to > > represent things that are not disk addresses. Instead, it should either > > define its own type or directly use (u)int*_t. Then, as for choosing > > between 32 and 64 bits, I'm skeptical of using this structure for > > managing more than 32 bits worth of blocks, given the amount of RAM it > > will use. > >=20 > >=20 > >=20 >=20 > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.14 (FreeBSD) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >=20 > iEYEARECAAYFAk5QABQACgkQwcJ4iSZ1q2kdXwCfWPN48wauijoGOQCUaalYnFCR > BIgAnRLCuDmPwySp1gd0xf+UPG5nC7KJ > =3DsP6M > -----END PGP SIGNATURE----- --fXc9gqH37d6mfFz8 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAk5QCEYACgkQC3+MBN1Mb4g3VQCfYlGrzdJOUw3Z2pL0mAWpb9fK 6hsAoLHoHVBteVjYBCRBEfRGCbACp6HU =BGLI -----END PGP SIGNATURE----- --fXc9gqH37d6mfFz8--