Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 20 Aug 2011 13:20:03 -0500
From:      Alan Cox <alc@rice.edu>
To:        Kostik Belousov <kostikbel@gmail.com>
Cc:        alc@freebsd.org, freebsd-stable@freebsd.org, perryh@pluto.rain.com, "Alexander V. Chernikov" <melifaro@ipfw.ru>, daniel@digsys.bg
Subject:   Re: 32GB limit per swap device?
Message-ID:  <4E4FFAD3.4090706@rice.edu>
In-Reply-To: <20110820174147.GW17489@deviant.kiev.zoral.com.ua>
References:  <4E4143A6.6030307@digsys.bg> <935F8EC2-88E0-45A3-BE8B-7210BE223BC5@mac.com> <4e42a0c0.e2t/9MF98O3HFjb1%perryh@pluto.rain.com> <4E4CCA6C.8020408@ipfw.ru> <CAJUyCcMc7m65c_XjHNFi0A4cHHySC1brLS7HdivstxeOi6uFQw@mail.gmail.com> <20110820174147.GW17489@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 08/20/2011 12:41, Kostik Belousov wrote:
> On Sat, Aug 20, 2011 at 12:33:29PM -0500, Alan Cox wrote:
>> On Thu, Aug 18, 2011 at 3:16 AM, Alexander V. Chernikov<melifaro@ipfw.ru>wrote:
>>
>>> On 10.08.2011 19:16, perryh@pluto.rain.com wrote:
>>>
>>>> Chuck Swiger<cswiger@mac.com>   wrote:
>>>>
>>>>   On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote:
>>>>>> I am trying to set up 64GB partitions for swap for a system that
>>>>>> has 64GB of RAM (with the idea to dump kernel core etc). But, on
>>>>>> 8-stable as of today I get:
>>>>>>
>>>>>> WARNING: reducing size to maximum of 67108864 blocks per swap unit
>>>>>>
>>>>>> Is there workaround for this limitation?
>>>>>>
>>> Another interesting question:
>>>
>>> swap pager operates in page blocks (PAGE_SIZE=4k on common arch).
>>>
>>> Block device size in passed to swaponsomething() in number of _disk_ blocks
>>>   (e.g. in DEV_BSIZE=512). After that, kernel b-lists (on top of which swap
>>> pager is build) maximum objects check is enforced.
>>>
>>> The (possible) problem is that real object count we will operate on is not
>>> the value passed to swaponsomething() since it is calculated in wrong units.
>>>
>>> we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value which
>>> is rough (X / 8) so we should be able to address 32*8=256G.
>>>
>>> The code should look like this:
>>>
>>> Index: vm/swap_pager.c
>>> ==============================**==============================**=======
>>> --- vm/swap_pager.c     (revision 223877)
>>> +++ vm/swap_pager.c     (working copy)
>>> @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id, u_long
>>>         u_long mblocks;
>>>
>>>         /*
>>> +        * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks.
>>> +        * First chop nblks off to page-align it, then convert.
>>> +        *
>>> +        * sw->sw_nblks is in page-sized chunks now too.
>>> +        */
>>> +       nblks&= ~(ctodb(1) - 1);
>>> +       nblks = dbtoc(nblks);
>>> +
>>> +       /*
>>>
>>>          * If we go beyond this, we get overflows in the radix
>>>          * tree bitmap code.
>>>          */
>>> @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id, u_long
>>>                         mblocks);
>>>                 nblks = mblocks;
>>>         }
>>> -       /*
>>> -        * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks.
>>> -        * First chop nblks off to page-align it, then convert.
>>> -        *
>>> -        * sw->sw_nblks is in page-sized chunks now too.
>>> -        */
>>> -       nblks&= ~(ctodb(1) - 1);
>>> -       nblks = dbtoc(nblks);
>>>
>>>         sp = malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO);
>>>         sp->sw_vp = vp;
>>>
>>>
>>> (move pages recalculation before b-list check)
>>>
>>>
>>> Can someone comment on this?
>>>
>>>
>> I believe that you are correct.  Have you tried testing this change on a
>> large swap device?
> I probably agree too, but I am in the process of re-reading the swap code,
> and I do not quite believe in the limit.
>

I'm uncertain whether the current limit, "0x40000000 / 
BLIST_META_RADIX", is exact or not, but I doubt that it is too large.

> When the initial code was committed, our daddr_t was 32bit, I checked
> the RELENG_4 sources. Current code uses int64_t for daddr_t. My impression
> right now is that we only utilize the low 32bits of daddr_t.
>
> Esp. interesting looks the following typedef:
> typedef	uint32_t	u_daddr_t;	/* unsigned disk address */
> which (correctly) means that typical mask (u_daddr_t)-1 is 0xffffffff.
>
> I wonder whether we could just use full 64bit and de-facto remove the
> limitation on the swap partition size.

I would rather argue first that the subr_list code should not be using 
daddr_t all.  The code is abusing daddr_t and defining u_daddr_t to 
represent things that are not disk addresses.  Instead, it should either 
define its own type or directly use (u)int*_t.  Then, as for choosing 
between 32 and 64 bits, I'm skeptical of using this structure for 
managing more than 32 bits worth of blocks, given the amount of RAM it 
will use.





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E4FFAD3.4090706>