Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 21 Sep 2013 03:09:24 +0200
From:      Cedric Blancher <cedric.blancher@gmail.com>
To:        Sebastian Kuzminsky <S.Kuzminsky@f5.com>
Cc:        Patrick Dung <patrick_dkt@yahoo.com.hk>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, "ivoras@freebsd.org" <ivoras@freebsd.org>
Subject:   Re: About Transparent Superpages and Non-transparent superapges
Message-ID:  <CALXu0UeA14y8Riy1ji777-gvW5CJb9soRoxg7fzB6kF1Nn=Cbg@mail.gmail.com>
In-Reply-To: <B3A1DB16-7919-4BFA-893C-5E8502F16C17@f5.com>
References:  <mailman.2681.1379448875.363.freebsd-hackers@freebsd.org> <1379520488.49964.YahooMailNeo@web193502.mail.sg3.yahoo.com> <22E7E628-E997-4B64-B229-92E425D85084@f5.com> <1379649991.82562.YahooMailNeo@web193502.mail.sg3.yahoo.com> <B3A1DB16-7919-4BFA-893C-5E8502F16C17@f5.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 20 September 2013 17:20, Sebastian Kuzminsky <S.Kuzminsky@f5.com> wrote:
> On Sep 19, 2013, at 22:06 , Patrick Dung wrote:
>
>> >We at Line Rate (now F5) are developing support for 1 Gig superpages on=
 amd64.  We're basing our work on 9.1.0 for now.
>> >
>> >An early preview is available here:
>> >
>> >https://github.com/Seb-LineRate/freebsd/tree/freebsd-9.1.0-1gig-pages-N=
OT-READY-2
>>
>> That is cool.
>>
>> What type of applications can take advantage of the 1Gb page size?
>> And is it transparent? Or applications need to be modified?
>
> It's transparent for the kernel: all of UMA and kmem_malloc()/kmem_free()=
 is backed by 1 gig superpages.
>
> It's not transparent for userspace: applications need to pass a new flag =
to mmap() to get 1 gig pages.

That may be the wrong approach. What happens if x86 gets more
huge/largepage sizes like SPARC does (hint: Sign NDA with Intel and
AMD and get surprised, and then allocate 16 more bits for mmap() if
you wish to stick with your approach)? For example SPARC64 does 8k,
64k, 512k, 4M, 32M, 256M, 2GB and 256GB pages (actual page sizes
differ from MMU to MMU implementation, and can be probed via pagesize
-a).

A much better option would be to follow the Solaris API which has APIs
to enumerate the available page sizes, and then set it either for
heap, stack or a given address range (the last one is used to use
largepages for file I/O via mmap()).

For example ksh93 uses this to use 64k pages for the stack (this
mainly aims at SPARC where 64k stack pages can be a real performance
booster if you shuffle a lot of strings via stack):
-----------
int main(int argc, char *argv[])
{
#if _lib_memcntl
        /* advise larger stack size */
        struct memcntl_mha mha;
        mha.mha_cmd =3D MHA_MAPSIZE_STACK;
        mha.mha_flags =3D 0;
        mha.mha_pagesize =3D 64 * 1024;
        (void)memcntl(NULL, 0, MC_HAT_ADVISE, (caddr_t)&mha, 0, 0);
#endif
        return(sh_main(argc, argv, (Shinit_f)0));
}
-----------

Below is the memcntl(2) manpage describing the API:
---------------------------------------



System Calls                                           memcntl(2)



NAME
     memcntl - memory management control

SYNOPSIS
     #include <sys/types.h>
     #include <sys/mman.h>

     int memcntl(caddr_t _=08a_=08d_=08d_=08r, size_t _=08l_=08e_=08n, int
_=08c_=08m_=08d, caddr_t _=08a_=08r_=08g,
          int _=08a_=08t_=08t_=08r, int _=08m_=08a_=08s_=08k);


DESCRIPTION
     The memcntl() function allows the calling process to apply a
     variety of control operations over the address space identi-
     fied by the  mappings  established  for  the  address  range
     [_=08a_=08d_=08d_=08r, _=08a_=08d_=08d_=08r + _=08l_=08e_=08n).


     The _=08a_=08d_=08d_=08r argument must be a  multiple  of  the  pagesi=
ze  as
     returned by sysconf(3C). The scope of the control operations
     can be further defined with  additional  selection  criteria
     (in  the  form  of  attributes) according to the bit pattern
     contained in _=08a_=08t_=08t_=08r.


     The following attributes specify page mapping selection cri-
     teria:

     SHARED     Page is mapped shared.


     PRIVATE    Page is mapped private.



     The following attributes specify page  protection  selection
     criteria.  The  selection criteria are constructed by a bit-
     wise OR operation on  the  attribute  bits  and  must  match
     exactly.

     PROT_READ     Page can be read.


     PROT_WRITE    Page can be written.


     PROT_EXEC     Page can be executed.



     The following criteria may also be specified:




SunOS 5.11          Last change: 10 Apr 2007                    1






System Calls                                           memcntl(2)



     PROC_TEXT    Process text.


     PROC_DATA    Process data.



     The PROC_TEXT attribute specifies all privately mapped  seg-
     ments  with  read  and execute permission, and the PROC_DATA
     attribute specifies all privately mapped segments with write
     permission.


     Selection criteria can be used to describe various  abstract
     memory objects within the address space on which to operate.
     If an operation shall not be constrained  by  the  selection
     criteria, _=08a_=08t_=08t_=08r must have the value 0.


     The operation to be performed is identified by the  argument
     _=08c_=08m_=08d.  The  symbolic  names for the operations are defined =
in
     <sys/mman.h> as follows:

     MC_LOCK

         Lock in memory all pages in the  range  with  attributes
         _=08a_=08t_=08t_=08r.  A given page may be locked multiple times t=
hrough
         different mappings; however,  within  a  given  mapping,
         page  locks do not nest. Multiple lock operations on the
         same address in the same process  will  all  be  removed
         with  a  single  unlock  operation. A page locked in one
         process and mapped in another (or visible through a dif-
         ferent  mapping  in  the  locking  process) is locked in
         memory as long as the locking process  does  neither  an
         implicit nor explicit unlock operation. If a locked map-
         ping is removed, or a page is deleted through file remo-
         val  or  truncation,  an  unlock operation is implicitly
         performed. If a writable MAP_PRIVATE page in the address
         range  is  changed,  the lock will be transferred to the
         private page.

         The _=08a_=08r_=08g argument is not used, but must be  0  to  ensu=
re
         compatibility with potential future enhancements.


     MC_LOCKAS

         Lock in memory all pages mapped  by  the  address  space
         with attributes _=08a_=08t_=08t_=08r. The _=08a_=08d_=08d_=08r and=
 _=08l_=08e_=08n
arguments are not
         used, but must be _=08N_=08U_=08L_=08L and  0  respectively,  to  =
ensure
         compatibility  with  potential future enhancements.  The
         _=08a_=08r_=08g argument is a bit pattern built from the flags:



SunOS 5.11          Last change: 10 Apr 2007                    2






System Calls                                           memcntl(2)



         MCL_CURRENT    Lock current mappings.


         MCL_FUTURE     Lock future mappings.

         The value of _=08a_=08r_=08g determines  whether  the  pages  to  =
be
         locked  are those currently mapped by the address space,
         those that will be mapped in the  future,  or  both.  If
         MCL_FUTURE  is specified, then all mappings subsequently
         added to the address space will be locked, provided suf-
         ficient memory is available.


     MC_SYNC

         Write to their backing storage  locations  all  modified
         pages  in  the  range  with attributes _=08a_=08t_=08t_=08r. Optio=
nally,
         invalidate cache copies. The backing storage for a modi-
         fied  MAP_SHARED  mapping is the file the page is mapped
         to; the backing storage for a modified MAP_PRIVATE  map-
         ping is its swap area. The _=08a_=08r_=08g argument is a bit patte=
rn
         built from the flags used to control the behavior of the
         operation:

         MS_ASYNC         Perform asynchronous writes.


         MS_SYNC          Perform synchronous writes.


         MS_INVALIDATE    Invalidate mappings.

         MS_ASYNC Return immediately once  all  write  operations
         are scheduled; with MS_SYNC the function will not return
         until all write operations are completed.

         MS_INVALIDATE Invalidate all cached copies  of  data  in
         memory,  so that further references to the pages will be
         obtained by the system from their backing storage  loca-
         tions.  This  operation  should  be used by applications
         that require a memory object to be in a known state.


     MC_UNLOCK

         Unlock all pages in the range with attributes _=08a_=08t_=08t_=08r=
.  The
         _=08a_=08r_=08g argument is not used, but must be 0 to ensure comp=
a-
         tibility with potential future enhancements.


     MC_UNLOCKAS




SunOS 5.11          Last change: 10 Apr 2007                    3






System Calls                                           memcntl(2)



         Remove address space memory locks and locks on all pages
         in  the  address  space  with attributes _=08a_=08t_=08t_=08r. The
_=08a_=08d_=08d_=08r,
         _=08l_=08e_=08n, and _=08a_=08r_=08g arguments are not used, but m=
ust be
_=08N_=08U_=08L_=08L, 0
         and 0, respectively, to ensure compatibility with poten-
         tial future enhancements.


     MC_HAT_ADVISE

         Advise system how a region of user-mapped memory will be
         accessed.  The  _=08a_=08r_=08g argument is interpreted as a "stru=
ct
         memcntl_mha *". The following members are defined  in  a
         struct memcntl_mha:

           uint_t mha_cmd;
           uint_t mha_flags;
           size_t mha_pagesize;

         The accepted values for mha_cmd are:

           MHA_MAPSIZE_VA
           MHA_MAPSIZE_STACK
           MHA_MAPSIZE_BSSBRK

         The mha_flags member is reserved for future use and must
         always  be  set  to 0. The mha_pagesize member must be a
         valid size as obtained from getpagesizes(3C) or the con-
         stant value 0 to allow the system to choose an appropri-
         ate hardware address translation mapping size.

         MHA_MAPSIZE_VA  sets  the  preferred  hardware   address
         translation  mapping  size  of the region of memory from
         _=08a_=08d_=08d_=08r to _=08a_=08d_=08d_=08r + _=08l_=08e_=08n. Bo=
th _=08a_=08d_=08d_=08r
and _=08l_=08e_=08n must be aligned to
         an  mha_pagesize  boundary.  The  entire virtual address
         region from _=08a_=08d_=08d_=08r to _=08a_=08d_=08d_=08r + _=08l_=
=08e_=08n must not
have any  holes.
         Permissions  within each mha_pagesize-aligned portion of
         the region must be consistent.  When  a  size  of  0  is
         specified,  the system selects an appropriate size based
         on the size and alignment of the memory region, type  of
         processor, and other considerations.

         MHA_MAPSIZE_STACK sets the  preferred  hardware  address
         translation  mapping  size  of  the  process main thread
         stack segment. The _=08a_=08d_=08d_=08r and _=08l_=08e_=08n argume=
nts must
be  _=08N_=08U_=08L_=08L
         and 0, respectively.

         MHA_MAPSIZE_BSSBRK sets the preferred  hardware  address
         translation  mapping  size of the process heap. The _=08a_=08d_=08=
d_=08r
         and _=08l_=08e_=08n arguments must be _=08N_=08U_=08L_=08L and 0, =
respectively.  See
         the  NOTES section of the ppgsz(1) manual page for addi-
         tional information on process heap alignment.




SunOS 5.11          Last change: 10 Apr 2007                    4






System Calls                                           memcntl(2)



         The _=08a_=08t_=08t_=08r argument must be 0 for all MC_HAT_ADVISE =
opera-
         tions.



     The _=08m_=08a_=08s_=08k argument must be 0; it is reserved for future=
 use.


     Locks established with the lock operations are not inherited
     by  a  child  process  after fork(2). The memcntl() function
     fails if it attempts to lock  more  memory  than  a  system-
     specific limit.


     Due to the potential impact on system resources, the  opera-
     tions  MC_LOCKAS,  MC_LOCK,  MC_UNLOCKAS,  and MC_UNLOCK are
     restricted to privileged processes.

USAGE
     The memcntl() function subsumes the operations of plock(3C).


     MC_HAT_ADVISE is intended to improve performance of applica-
     tions  that  use  large amounts of memory on processors that
     support multiple hardware address translation mapping sizes;
     however,  it  should  be  used with care. Not all processors
     support all sizes with equal efficiency. Use of larger sizes
     may  also introduce extra overhead that could reduce perfor-
     mance or available memory.  Using large sizes for one appli-
     cation may reduce available resources for other applications
     and result in slower system wide performance.

RETURN VALUES
     Upon successful completion, memcntl() returns 0;  otherwise,
     it returns -1 and sets errno to indicate an error.

ERRORS
     The memcntl() function will fail if:

     EAGAIN    When the selection criteria match, some or all  of
               the  memory  identified by the operation could not
               be locked when MC_LOCK or MC_LOCKAS was specified,
               some  or  all mappings in the address range [_=08a_=08d_=08d=
_=08r,
               _=08a_=08d_=08d_=08r + _=08l_=08e_=08n) are locked for I/O w=
hen  MC_HAT_ADVISE
               was  specified,  or  the  system  has insufficient
               resources when MC_HAT_ADVISE was specified.

               The _=08c_=08m_=08d is MC_LOCK or MC_LOCKAS  and  locking  t=
he
               memory identified by this operation would exceed a
               limit or resource control on locked memory.





SunOS 5.11          Last change: 10 Apr 2007                    5






System Calls                                           memcntl(2)



     EBUSY     When the selection criteria match, some or all  of
               the  addresses in the range [_=08a_=08d_=08d_=08r, _=08a_=08=
d_=08d_=08r
+ _=08l_=08e_=08n) are
               locked and MC_SYNC with the  MS_INVALIDATE  option
               was specified.


     EINVAL    The _=08a_=08d_=08d_=08r argument specifies invalid selectio=
n cri-
               teria  or  is  not  a multiple of the page size as
               returned by   sysconf(3C);  the  _=08a_=08d_=08d_=08r  and/o=
r  _=08l_=08e_=08n
               argument  does not have the value 0 when MC_LOCKAS
               or MC_UNLOCKAS is specified; the _=08a_=08r_=08g  argument  =
is
               not valid for the function specified; mha_pagesize
               or mha_cmd is invalid; or MC_HAT_ADVISE is  speci-
               fied  and  not  all  pages in the specified region
               have the same access permissions within the  given
               size boundaries.


     ENOMEM    When the selection criteria match, some or all  of
               the  addresses in the range [_=08a_=08d_=08d_=08r, _=08a_=08=
d_=08d_=08r
+ _=08l_=08e_=08n) are
               invalid for the address  space  of  a  process  or
               specify one or more pages which are not mapped.


     EPERM     The  {PRIV_PROC_LOCK_MEMORY}  privilege   is   not
               asserted  in the effective set of the calling pro-
               cess  and  MC_LOCK,   MC_LOCKAS,   MC_UNLOCK,   or
               MC_UNLOCKAS was specified.


ATTRIBUTES
     See attributes(5) for descriptions of the  following  attri-
     butes:



     ____________________________________________________________
    |       ATTRIBUTE TYPE        |       ATTRIBUTE VALUE       |
    |______________________________=08|______________________________=08|
    | MT-Level                    | MT-Safe                     |
    |______________________________=08|______________________________=08|


SEE ALSO
     ppgsz(1), fork(2), mmap(2),  mprotect(2),  getpagesizes(3C),
     mlock(3C),  mlockall(3C), msync(3C), plock(3C), sysconf(3C),
     attributes(5), privileges(5)








SunOS 5.11          Last change: 10 Apr 2007                    6
---------------------------------------

Ced
--=20
Cedric Blancher <cedric.blancher@gmail.com>
Institute Pasteur



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CALXu0UeA14y8Riy1ji777-gvW5CJb9soRoxg7fzB6kF1Nn=Cbg>