Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 1 Dec 2011 08:11:27 -0800
From:      Peter Wemm <peter@wemm.org>
To:        Nathan Whitehorn <nwhitehorn@freebsd.org>
Cc:        alc@freebsd.org, Kostik Belousov <kostikbel@gmail.com>, Alan Cox <alan.l.cox@gmail.com>, Andreas Tobler <andreast-list@fgznet.ch>, FreeBSD Arch <freebsd-arch@freebsd.org>
Subject:   Re: powerpc64 malloc limit?
Message-ID:  <CAGE5yCrr54Y8E3Pw_AVHcKmnpaxkBNWBN1CQUE1ShO_SMyr2dg@mail.gmail.com>
In-Reply-To: <4ED792AC.4000501@freebsd.org>
References:  <4ED5BE19.70805@fgznet.ch> <20111130162236.GA50300@deviant.kiev.zoral.com.ua> <4ED65F70.7050700@fgznet.ch> <20111130170936.GB50300@deviant.kiev.zoral.com.ua> <4ED66B75.3060409@fgznet.ch> <CAGE5yCpe8rfZp3ErXrf_SFwY_KNYQDyF87TAypxajJa-FSqcpQ@mail.gmail.com> <CAJUyCcMPh818n-XxmBBCHUVJVZYQGaQN2AzGY9K8pEFm3rz-_w@mail.gmail.com> <4ED792AC.4000501@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Dec 1, 2011 at 6:43 AM, Nathan Whitehorn <nwhitehorn@freebsd.org> w=
rote:
> On 11/30/11 15:50, Alan Cox wrote:
>>
>> On Wed, Nov 30, 2011 at 12:12 PM, Peter Wemm<peter@wemm.org> =A0wrote:
>>
>>> On Wed, Nov 30, 2011 at 9:44 AM, Andreas Tobler<andreast-list@fgznet.ch=
>
>>> wrote:
>>>>
>>>> On 30.11.11 18:09, Kostik Belousov wrote:
>>>>>
>>>>> On Wed, Nov 30, 2011 at 05:53:04PM +0100, Andreas Tobler wrote:
>>>>>>
>>>>>> On 30.11.11 17:22, Kostik Belousov wrote:
>>>>>>>
>>>>>>> On Wed, Nov 30, 2011 at 06:24:41AM +0100, Andreas Tobler wrote:
>>>>>>>>
>>>>>>>> All,
>>>>>>>>
>>>>>>>> while working on gcc I found a very strange situation which render=
s
>>>
>>> my
>>>>>>>>
>>>>>>>> powerpc64 machine unusable.
>>>>>>>> The test case below tries to allocate that much memory as 'wanted'=
.
>>>
>>> The
>>>>>>>>
>>>>>>>> same test case on amd64 returns w/o trying to allocate mem because
>>>
>>> the
>>>>>>>>
>>>>>>>> size is far to big.
>>>>>>>>
>>>>>>>> I couldn't find the reason so far, that's why I'm here.
>>>>>>>>
>>>>>>>> As Nathan pointed out the VM_MAXUSER_SIZE is the biggest on
>>>
>>> powerpc64:
>>>>>>>>
>>>>>>>> #define VM_MAXUSER_ADDRESS =A0 =A0 =A0(0x7ffffffffffff000UL)
>>>>>>>>
>>>>>>>> So, I'd expect a system to return an allocation error when a user
>>>
>>> tries
>>>>>>>>
>>>>>>>> to allocate too much memory and not really trying it and going to =
be
>>>>>>>> unusable. Iow, I'd exepect the situation on powerpc64 as I see on
>>>>>>>> amd64.
>>>>>>>>
>>>>>>>> Can anybody explain me the situation, why do I not have a working
>>>
>>> limit
>>>>>>>>
>>>>>>>> on powerpc64?
>>>>>>>>
>>>>>>>> The machine itself has 7GB RAM and 12GB swap. The amd64 where I
>>>>>>>> compared
>>>>>>>> has around 4GB/4GB RAM/swap.
>>>>>>>>
>>>>>>>> TIA,
>>>>>>>> Andreas
>>>>>>>>
>>>>>>>> include<stdlib.h>
>>>>>>>> #include<stdio.h>
>>>>>>>>
>>>>>>>> int main()
>>>>>>>> {
>>>>>>>> =A0 =A0 =A0 =A0 =A0void *p;
>>>>>>>>
>>>>>>>> =A0 =A0 =A0 =A0 =A0p =3D (void*) malloc (1152921504606846968ULL);
>>>>>>>> =A0 =A0 =A0 =A0 =A0if (p !=3D NULL)
>>>>>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf("p =3D %p\n", p);
>>>>>>>>
>>>>>>>> =A0 =A0 =A0 =A0 =A0printf("p =3D %p\n", p);
>>>>>>>> =A0 =A0 =A0 =A0 =A0return (0);
>>>>>>>> }
>>>>>>>
>>>>>>> First, you should provide details of what consistutes 'the unusable
>>>>>>> machine situation' on powerpc.
>>>>>>
>>>>>> I can not login anymore, everything is stuck except the core control
>>>>>> mechanisms for example the fan controller.
>>>>>>
>>>>>> Top reports 'ugly' figures, below from a earlier try:
>>>>>>
>>>>>> last pid: =A06790; =A0load averages: =A00.78, =A00.84, =A00.86 =A0 =
=A0up 0+00:34:52
>>>>>> 22:42:29 47 processes: =A01 running, 46 sleeping
>>>>>> CPU: =A00.0% user, =A00.0% nice, 15.4% system, 11.8% interrupt, 72.8=
% idle
>>>>>> Mem: 5912M Active, 570M Inact, 280M Wired, 26M Cache, 104M Buf, 352K
>>>
>>> Free
>>>>>>
>>>>>> Swap: 12G Total, 9904M Used, 2383M Free, 80% Inuse, 178M Out
>>>>>>
>>>>>> =A0 =A0PID USERNAME =A0 =A0THR PRI NICE =A0 SIZE =A0 =A0RES STATE =
=A0 C =A0 TIME =A0 WCPU
>>>>>> COMMAND
>>>>>> =A0 6768 andreast =A0 =A0 =A01 =A052 =A0 =A001073741824G =A06479M pf=
ault =A01 =A0 0:58
>>>>>> 18.90% 31370.
>>>>>>
>>>>>> And after my mem and swap are full I see swap_pager_getswapspace(16)
>>>>>> failed.
>>>>>>
>>>>>> In this state I can only power-cycle the machine.
>>>>>>
>>>>>>> That said, on amd64 the user map is between 0 and 0x7fffffffffff,
>>>
>>> which
>>>>>>>
>>>>>>> obviously less then the requested allocation size 0x100000000000000=
.
>>>>>>> If you look at the kdump output on amd64, you will see that malloc(=
)
>>>>>>> tries to mmap() the area, fails and retries with obreak(). Default
>>>>>>> virtual memory limit is unlimited, so my best quess is that on amd6=
4
>>>>>>> vm_map_findspace() returns immediately.
>>>>>>>
>>>>>>> On powerpc64, I see no reason why vm_map_entry cannot be allocated,
>>>
>>> but
>>>>>>>
>>>>>>> please note that vm object and pages shall be only allocated on
>>>
>>> demand.
>>>>>>>
>>>>>>> So I am curious how does your machine breaks and where.
>>>>>>
>>>>>> I would expect that the 'system' does not allow me to allocate that
>>>
>>> much
>>>>>>
>>>>>> of ram.
>>>>>
>>>>> Does the issue with machine going into limbo reproducable with the co=
de
>>>>> you posted ?
>>>>
>>>> If I understand you correctly, yes. I can launch the test case and the
>>>> machine is immediately unusable. Means I can not kill the process nor
>>>
>>> can I
>>>>
>>>> log in. Also, top does not show anything useful.
>>>>
>>>> The original test case where I discovered this behavior behaves a bit
>>>> different.
>>>>
>>>
>>> http://gcc.gnu.org/viewcvs/trunk/libstdc%2B%2B-v3/testsuite/23_containe=
rs/vector/bool/modifiers/insert/31370.cc?revision=3D169421&view=3Dmarkup
>>>>
>>>> Here I can follow how the ram and swap is eaten up. Top is reporting t=
he
>>>> figures. If everything is 'full', the swaper errors start to appear on
>>>
>>> the
>>>>
>>>> console.
>>>>
>>>>> Or, do you need to actually touch the pages in the allocated region ?
>>>>
>>>> If I have to, how would I do that?
>>>>
>>>>> If the later (and I do expect that), then how many pages do you need
>>>>> to touch before machine breaks ? Is it single access that causes the
>>>>> havoc, or you need to touch the amount approximately equal to RAM+swa=
p
>>>>> ?
>>>>
>>>> Andreas
>>>
>>> ia64 had some vaguely related excitement earlier in its life. =A0 =A0If
>>> you created a 1TB sparse file and mmap'ed it over and over, tens,
>>> maybe hundreds of thousands of times, certain VM internal state got
>>> way out of hand. =A0mmaping was fine, but unmapping took 36 hours of cp=
u
>>> runtime when I killed the process. =A0It got so far out of hand because
>>> of the way ia64 handled just-in-time mappings on vhpt misses.
>>>
>>>
>> There is a fundamental scalability problem with the powerpc64/aim pmap.
>> See revision 212360. =A0In a nutshell, unlike amd64, ia64, and most othe=
r
>> pmap implementations, the powerpc64/aim pmap implementation doesn't link
>> together all of the pv entries belonging to a pmap into a list. =A0So, t=
he
>> powerpc64/aim implementations of range operations, like pmap_remove(),
>> don't handle large, sparsely populated ranges as efficiently as ia64 doe=
s.
>> Moreover, powerpc64/aim can't effectively implement pmap_remove_pages(),
>> and so it doesn't even try.
>>
>
> This is really irritating to fix. The PMAP layer is really designed aroun=
d a
> tree-based page table layout, which PowerPC/AIM does not have. One fix fo=
r
> at least some issues might be to also link the PVO structures into the pm=
ap
> -- some things would not be efficient, since it would be a flat list, but
> it's at least not that complicated.
> -Nathan

It's not entirely that simple.  The pmap layer exists so that the
upper level vm layers don't have to learn anything about page trees.
Its already been shown to work on tree-less systems like ia64.  On
MIPS, you could, if you wished, write a TLB miss handler that parsed
the upper layer vm structures if you were crazy enough.

So no, the pmap interface isn't designed around assuming trees.  Its
just that the most often used pmap implementations happen to use them.
 But not all.

The upper layers do make some assumptions about what operations are
cheap vs expensive though and on other pmaps we made implementation
changes to support observed use cases.  The scenarios that Alan
mentioned remind me of what John Dyson was working on back in the late
90's when working with things like buildworld speed (fork/exec/exit)
and ld.so mmap/munmap speed.

In particular a second linked list was added to pv entries to speed up
an operation that needed to be faster and API's were added to access
the speedup.  That was more about speeding up pv entry management than
assuming trees.  This particular address space rundown optimization
doesn't seem to exist in some of the other BSD platform ports that
people have used as models.

There's another optimization that Apple uses that we don't.  They
added an extra has in there somewhere to deal with a certain
scalability issue when pv entry chains get very long due to their
prelinked shared library stuff.

--=20
Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV
"All of this is for nothing if we don't go to the stars" - JMS/B5
"If Java had true garbage collection, most programs would delete
themselves upon execution." -- Robert Sewell



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGE5yCrr54Y8E3Pw_AVHcKmnpaxkBNWBN1CQUE1ShO_SMyr2dg>