Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Jan 2003 16:57:32 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Peter Wemm <peter@wemm.org>
Cc:        David Schultz <dschultz@uclink.Berkeley.EDU>, "Andrew R. Reiter" <arr@watson.org>, Scott Long <scott_long@btc.adaptec.com>, arch@FreeBSD.ORG
Subject:   Re: PAE (was Re: bus_dmamem_alloc_size())
Message-ID:  <3E39C9FC.3EAF3345@mindspring.com>
References:  <20030131003323.B42622A8A1@canning.wemm.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Peter Wemm wrote:
> I beg to differ about PSE36.  Since it still runs on 32 bit page tables,
> all PSE36 does is enable 4MB mappings that are targeted at above the 4G
> bounrary.  It does this by shifting the PTD entries for 4MB pages across
> by 4 bits in order to squeeze the extra bits in.

Actually, they're 2M.  They eat a bit there.  8-).

> For some things this would be useful.  But remember you can *only* use
> it in 4MB chunks.  Our VM system isn't geared for that and we'd have
> to come up with an infrastructure to somehow get it to within reach of
> userland.  Maybe it could be used to provide backing store for things
> like system V shared memory, but the lack of size granularity would make
> it interesting.  And since its 4MB chunks, forget paging and mmap etc.
> PSE36 really treats memory above 4G as second-class.

That's pretty much my point: the memory above 4G *is* second class,
in that it requires making memory below 4G *unavailable* in order
to make itself available, even if you use PAE.  The problem is one
of simultaneous access by multiple processes, and PSE36 at least
allows that, if badly, whereas PAE doesn't.

You're right about the VM system not being geared for it.  Going
to 2M instead of 4M "PSE pages" would be rather a pain, and that's
just one of a half dozen issues.

As to paging of 2M pages, I've actually always thought it needed to
be fixed so that large pages could be supported directly via paging.
It's not unreasonable to want to page at a ratio of 1:32,768, which
is what you would be getting.  Comparing 4K page on a 4G system, it's
a 1:1,048,576 ratio; that's only really an denflation of 32 times in
the number of pageable objects mapping an entire address space.


> On the other hand, PAE treats all memory as "first class" and is useable
> everywhere.  The cost is that you need to do 64 bit idempotent writes to
> the page tables if you ever want to use it on SMP.  But at least it
> can be used for page cache, generic process data, malloc etc etc.

It's usable, but not simultaneously.  A really good example here
would be buffer cache entries and mbufs, for something like a
"sendfile" operation.  If you have an FTP server with this
arrangement, and it's loaded enough to actually use the RAM, then
you will end up with FTP clients that end up stalling each other
at the driver level.

You could *maybe* get around it by making sure that the network
cards all did checksum offloading, were all capable of doing 64
bit addressing, and then pre-creating the mbuf list for the entire
"wired" region of the file, well in excess of the sendspace limit.
I've done that in a product or two (jacked around with ignoring
the sendspace limit, and putting huge chains of mbufs on a list).
But the cost of doing that is moving your mbufs to a 64 bit address
space, seperate from the rest of the kernel.

If you don't seperate inbound and outbound mbuf pools into 32 bit
and 64 bit pools, then you have to face the possibility of dealing
with the simultaneous access issue, for, for example, every mbuf
in an mbuf chain for an m_pullup operation.  The overhead for
several TCP streams where you are doing that would be killer.

I think it's probably better to acknowledge that the memory above
4G *is* second class, and then treat it as an L3 cache, and (maybe)
a DMA target for transfers *into* it, but not for transfers out.
It gets ugly fast, because of the cross-boundary stalls.  To me,
PAE is more like the segments in Windows 3.11; the OS has to be
built from the ground up to expect them, and use them properly.

8-(.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E39C9FC.3EAF3345>