Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 21 Sep 2013 03:16:06 +0200
From:      Cedric Blancher <cedric.blancher@gmail.com>
To:        Sebastian Kuzminsky <S.Kuzminsky@f5.com>
Cc:        Patrick Dung <patrick_dkt@yahoo.com.hk>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, "ivoras@freebsd.org" <ivoras@freebsd.org>
Subject:   Re: About Transparent Superpages and Non-transparent superapges
Message-ID:  <CALXu0UdkTrJFR53VTZPJ2ENPhaGLgedPvTWsy7JVdDL_0QMSig@mail.gmail.com>
In-Reply-To: <CALXu0UeA14y8Riy1ji777-gvW5CJb9soRoxg7fzB6kF1Nn=Cbg@mail.gmail.com>
References:  <mailman.2681.1379448875.363.freebsd-hackers@freebsd.org> <1379520488.49964.YahooMailNeo@web193502.mail.sg3.yahoo.com> <22E7E628-E997-4B64-B229-92E425D85084@f5.com> <1379649991.82562.YahooMailNeo@web193502.mail.sg3.yahoo.com> <B3A1DB16-7919-4BFA-893C-5E8502F16C17@f5.com> <CALXu0UeA14y8Riy1ji777-gvW5CJb9soRoxg7fzB6kF1Nn=Cbg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
[repost, the previous email was stuck because I used an old email address]

On 21 September 2013 03:09, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> On 20 September 2013 17:20, Sebastian Kuzminsky <S.Kuzminsky@f5.com> wrote:
>> On Sep 19, 2013, at 22:06 , Patrick Dung wrote:
>>
>>> >We at Line Rate (now F5) are developing support for 1 Gig superpages on amd64.  We're basing our work on 9.1.0 for now.
>>> >
>>> >An early preview is available here:
>>> >
>>> >https://github.com/Seb-LineRate/freebsd/tree/freebsd-9.1.0-1gig-pages-NOT-READY-2
>>>
>>> That is cool.
>>>
>>> What type of applications can take advantage of the 1Gb page size?
>>> And is it transparent? Or applications need to be modified?
>>
>> It's transparent for the kernel: all of UMA and kmem_malloc()/kmem_free() is backed by 1 gig superpages.
>>
>> It's not transparent for userspace: applications need to pass a new flag to mmap() to get 1 gig pages.
>
> That may be the wrong approach. What happens if x86 gets more
> huge/largepage sizes like SPARC does (hint: Sign NDA with Intel and
> AMD and get surprised, and then allocate 16 more bits for mmap() if
> you wish to stick with your approach)? For example SPARC64 does 8k,
> 64k, 512k, 4M, 32M, 256M, 2GB and 256GB pages (actual page sizes
> differ from MMU to MMU implementation, and can be probed via pagesize
> -a).
>
> A much better option would be to follow the Solaris API which has APIs
> to enumerate the available page sizes, and then set it either for
> heap, stack or a given address range (the last one is used to use
> largepages for file I/O via mmap()).
>
> For example ksh93 uses this to use 64k pages for the stack (this
> mainly aims at SPARC where 64k stack pages can be a real performance
> booster if you shuffle a lot of strings via stack):
> -----------
> int main(int argc, char *argv[])
> {
> #if _lib_memcntl
>         /* advise larger stack size */
>         struct memcntl_mha mha;
>         mha.mha_cmd = MHA_MAPSIZE_STACK;
>         mha.mha_flags = 0;
>         mha.mha_pagesize = 64 * 1024;
>         (void)memcntl(NULL, 0, MC_HAT_ADVISE, (caddr_t)&mha, 0, 0);
> #endif
>         return(sh_main(argc, argv, (Shinit_f)0));
> }
> -----------
>
> Below is the memcntl(2) manpage describing the API:
> ---------------------------------------
>
>
>
> System Calls                                           memcntl(2)
>
>
>
> NAME
>      memcntl - memory management control
>
> SYNOPSIS
>      #include <sys/types.h>
>      #include <sys/mman.h>
>
>      int memcntl(caddr_t _ a_ d_ d_ r, size_t _ l_ e_ n, int
> _ c_ m_ d, caddr_t _ a_ r_ g,
>           int _ a_ t_ t_ r, int _ m_ a_ s_ k);
>
>
> DESCRIPTION
>      The memcntl() function allows the calling process to apply a
>      variety of control operations over the address space identi-
>      fied by the  mappings  established  for  the  address  range
>      [_ a_ d_ d_ r, _ a_ d_ d_ r + _ l_ e_ n).
>
>
>      The _ a_ d_ d_ r argument must be a  multiple  of  the  pagesize  as
>      returned by sysconf(3C). The scope of the control operations
>      can be further defined with  additional  selection  criteria
>      (in  the  form  of  attributes) according to the bit pattern
>      contained in _ a_ t_ t_ r.
>
>
>      The following attributes specify page mapping selection cri-
>      teria:
>
>      SHARED     Page is mapped shared.
>
>
>      PRIVATE    Page is mapped private.
>
>
>
>      The following attributes specify page  protection  selection
>      criteria.  The  selection criteria are constructed by a bit-
>      wise OR operation on  the  attribute  bits  and  must  match
>      exactly.
>
>      PROT_READ     Page can be read.
>
>
>      PROT_WRITE    Page can be written.
>
>
>      PROT_EXEC     Page can be executed.
>
>
>
>      The following criteria may also be specified:
>
>
>
>
> SunOS 5.11          Last change: 10 Apr 2007                    1
>
>
>
>
>
>
> System Calls                                           memcntl(2)
>
>
>
>      PROC_TEXT    Process text.
>
>
>      PROC_DATA    Process data.
>
>
>
>      The PROC_TEXT attribute specifies all privately mapped  seg-
>      ments  with  read  and execute permission, and the PROC_DATA
>      attribute specifies all privately mapped segments with write
>      permission.
>
>
>      Selection criteria can be used to describe various  abstract
>      memory objects within the address space on which to operate.
>      If an operation shall not be constrained  by  the  selection
>      criteria, _ a_ t_ t_ r must have the value 0.
>
>
>      The operation to be performed is identified by the  argument
>      _ c_ m_ d.  The  symbolic  names for the operations are defined in
>      <sys/mman.h> as follows:
>
>      MC_LOCK
>
>          Lock in memory all pages in the  range  with  attributes
>          _ a_ t_ t_ r.  A given page may be locked multiple times through
>          different mappings; however,  within  a  given  mapping,
>          page  locks do not nest. Multiple lock operations on the
>          same address in the same process  will  all  be  removed
>          with  a  single  unlock  operation. A page locked in one
>          process and mapped in another (or visible through a dif-
>          ferent  mapping  in  the  locking  process) is locked in
>          memory as long as the locking process  does  neither  an
>          implicit nor explicit unlock operation. If a locked map-
>          ping is removed, or a page is deleted through file remo-
>          val  or  truncation,  an  unlock operation is implicitly
>          performed. If a writable MAP_PRIVATE page in the address
>          range  is  changed,  the lock will be transferred to the
>          private page.
>
>          The _ a_ r_ g argument is not used, but must be  0  to  ensure
>          compatibility with potential future enhancements.
>
>
>      MC_LOCKAS
>
>          Lock in memory all pages mapped  by  the  address  space
>          with attributes _ a_ t_ t_ r. The _ a_ d_ d_ r and _ l_ e_ n
> arguments are not
>          used, but must be _ N_ U_ L_ L and  0  respectively,  to  ensure
>          compatibility  with  potential future enhancements.  The
>          _ a_ r_ g argument is a bit pattern built from the flags:
>
>
>
> SunOS 5.11          Last change: 10 Apr 2007                    2
>
>
>
>
>
>
> System Calls                                           memcntl(2)
>
>
>
>          MCL_CURRENT    Lock current mappings.
>
>
>          MCL_FUTURE     Lock future mappings.
>
>          The value of _ a_ r_ g determines  whether  the  pages  to  be
>          locked  are those currently mapped by the address space,
>          those that will be mapped in the  future,  or  both.  If
>          MCL_FUTURE  is specified, then all mappings subsequently
>          added to the address space will be locked, provided suf-
>          ficient memory is available.
>
>
>      MC_SYNC
>
>          Write to their backing storage  locations  all  modified
>          pages  in  the  range  with attributes _ a_ t_ t_ r. Optionally,
>          invalidate cache copies. The backing storage for a modi-
>          fied  MAP_SHARED  mapping is the file the page is mapped
>          to; the backing storage for a modified MAP_PRIVATE  map-
>          ping is its swap area. The _ a_ r_ g argument is a bit pattern
>          built from the flags used to control the behavior of the
>          operation:
>
>          MS_ASYNC         Perform asynchronous writes.
>
>
>          MS_SYNC          Perform synchronous writes.
>
>
>          MS_INVALIDATE    Invalidate mappings.
>
>          MS_ASYNC Return immediately once  all  write  operations
>          are scheduled; with MS_SYNC the function will not return
>          until all write operations are completed.
>
>          MS_INVALIDATE Invalidate all cached copies  of  data  in
>          memory,  so that further references to the pages will be
>          obtained by the system from their backing storage  loca-
>          tions.  This  operation  should  be used by applications
>          that require a memory object to be in a known state.
>
>
>      MC_UNLOCK
>
>          Unlock all pages in the range with attributes _ a_ t_ t_ r.  The
>          _ a_ r_ g argument is not used, but must be 0 to ensure compa-
>          tibility with potential future enhancements.
>
>
>      MC_UNLOCKAS
>
>
>
>
> SunOS 5.11          Last change: 10 Apr 2007                    3
>
>
>
>
>
>
> System Calls                                           memcntl(2)
>
>
>
>          Remove address space memory locks and locks on all pages
>          in  the  address  space  with attributes _ a_ t_ t_ r. The
> _ a_ d_ d_ r,
>          _ l_ e_ n, and _ a_ r_ g arguments are not used, but must be
> _ N_ U_ L_ L, 0
>          and 0, respectively, to ensure compatibility with poten-
>          tial future enhancements.
>
>
>      MC_HAT_ADVISE
>
>          Advise system how a region of user-mapped memory will be
>          accessed.  The  _ a_ r_ g argument is interpreted as a "struct
>          memcntl_mha *". The following members are defined  in  a
>          struct memcntl_mha:
>
>            uint_t mha_cmd;
>            uint_t mha_flags;
>            size_t mha_pagesize;
>
>          The accepted values for mha_cmd are:
>
>            MHA_MAPSIZE_VA
>            MHA_MAPSIZE_STACK
>            MHA_MAPSIZE_BSSBRK
>
>          The mha_flags member is reserved for future use and must
>          always  be  set  to 0. The mha_pagesize member must be a
>          valid size as obtained from getpagesizes(3C) or the con-
>          stant value 0 to allow the system to choose an appropri-
>          ate hardware address translation mapping size.
>
>          MHA_MAPSIZE_VA  sets  the  preferred  hardware   address
>          translation  mapping  size  of the region of memory from
>          _ a_ d_ d_ r to _ a_ d_ d_ r + _ l_ e_ n. Both _ a_ d_ d_ r
> and _ l_ e_ n must be aligned to
>          an  mha_pagesize  boundary.  The  entire virtual address
>          region from _ a_ d_ d_ r to _ a_ d_ d_ r + _ l_ e_ n must not
> have any  holes.
>          Permissions  within each mha_pagesize-aligned portion of
>          the region must be consistent.  When  a  size  of  0  is
>          specified,  the system selects an appropriate size based
>          on the size and alignment of the memory region, type  of
>          processor, and other considerations.
>
>          MHA_MAPSIZE_STACK sets the  preferred  hardware  address
>          translation  mapping  size  of  the  process main thread
>          stack segment. The _ a_ d_ d_ r and _ l_ e_ n arguments must
> be  _ N_ U_ L_ L
>          and 0, respectively.
>
>          MHA_MAPSIZE_BSSBRK sets the preferred  hardware  address
>          translation  mapping  size of the process heap. The _ a_ d_ d_ r
>          and _ l_ e_ n arguments must be _ N_ U_ L_ L and 0, respectively.  See
>          the  NOTES section of the ppgsz(1) manual page for addi-
>          tional information on process heap alignment.
>
>
>
>
> SunOS 5.11          Last change: 10 Apr 2007                    4
>
>
>
>
>
>
> System Calls                                           memcntl(2)
>
>
>
>          The _ a_ t_ t_ r argument must be 0 for all MC_HAT_ADVISE opera-
>          tions.
>
>
>
>      The _ m_ a_ s_ k argument must be 0; it is reserved for future use.
>
>
>      Locks established with the lock operations are not inherited
>      by  a  child  process  after fork(2). The memcntl() function
>      fails if it attempts to lock  more  memory  than  a  system-
>      specific limit.
>
>
>      Due to the potential impact on system resources, the  opera-
>      tions  MC_LOCKAS,  MC_LOCK,  MC_UNLOCKAS,  and MC_UNLOCK are
>      restricted to privileged processes.
>
> USAGE
>      The memcntl() function subsumes the operations of plock(3C).
>
>
>      MC_HAT_ADVISE is intended to improve performance of applica-
>      tions  that  use  large amounts of memory on processors that
>      support multiple hardware address translation mapping sizes;
>      however,  it  should  be  used with care. Not all processors
>      support all sizes with equal efficiency. Use of larger sizes
>      may  also introduce extra overhead that could reduce perfor-
>      mance or available memory.  Using large sizes for one appli-
>      cation may reduce available resources for other applications
>      and result in slower system wide performance.
>
> RETURN VALUES
>      Upon successful completion, memcntl() returns 0;  otherwise,
>      it returns -1 and sets errno to indicate an error.
>
> ERRORS
>      The memcntl() function will fail if:
>
>      EAGAIN    When the selection criteria match, some or all  of
>                the  memory  identified by the operation could not
>                be locked when MC_LOCK or MC_LOCKAS was specified,
>                some  or  all mappings in the address range [_ a_ d_ d_ r,
>                _ a_ d_ d_ r + _ l_ e_ n) are locked for I/O when  MC_HAT_ADVISE
>                was  specified,  or  the  system  has insufficient
>                resources when MC_HAT_ADVISE was specified.
>
>                The _ c_ m_ d is MC_LOCK or MC_LOCKAS  and  locking  the
>                memory identified by this operation would exceed a
>                limit or resource control on locked memory.
>
>
>
>
>
> SunOS 5.11          Last change: 10 Apr 2007                    5
>
>
>
>
>
>
> System Calls                                           memcntl(2)
>
>
>
>      EBUSY     When the selection criteria match, some or all  of
>                the  addresses in the range [_ a_ d_ d_ r, _ a_ d_ d_ r
> + _ l_ e_ n) are
>                locked and MC_SYNC with the  MS_INVALIDATE  option
>                was specified.
>
>
>      EINVAL    The _ a_ d_ d_ r argument specifies invalid selection cri-
>                teria  or  is  not  a multiple of the page size as
>                returned by   sysconf(3C);  the  _ a_ d_ d_ r  and/or  _ l_ e_ n
>                argument  does not have the value 0 when MC_LOCKAS
>                or MC_UNLOCKAS is specified; the _ a_ r_ g  argument  is
>                not valid for the function specified; mha_pagesize
>                or mha_cmd is invalid; or MC_HAT_ADVISE is  speci-
>                fied  and  not  all  pages in the specified region
>                have the same access permissions within the  given
>                size boundaries.
>
>
>      ENOMEM    When the selection criteria match, some or all  of
>                the  addresses in the range [_ a_ d_ d_ r, _ a_ d_ d_ r
> + _ l_ e_ n) are
>                invalid for the address  space  of  a  process  or
>                specify one or more pages which are not mapped.
>
>
>      EPERM     The  {PRIV_PROC_LOCK_MEMORY}  privilege   is   not
>                asserted  in the effective set of the calling pro-
>                cess  and  MC_LOCK,   MC_LOCKAS,   MC_UNLOCK,   or
>                MC_UNLOCKAS was specified.
>
>
> ATTRIBUTES
>      See attributes(5) for descriptions of the  following  attri-
>      butes:
>
>
>
>      ____________________________________________________________
>     |       ATTRIBUTE TYPE        |       ATTRIBUTE VALUE       |
>     |______________________________ |______________________________ |
>     | MT-Level                    | MT-Safe                     |
>     |______________________________ |______________________________ |
>
>
> SEE ALSO
>      ppgsz(1), fork(2), mmap(2),  mprotect(2),  getpagesizes(3C),
>      mlock(3C),  mlockall(3C), msync(3C), plock(3C), sysconf(3C),
>      attributes(5), privileges(5)
>
>
>
>
>
>
>
>
> SunOS 5.11          Last change: 10 Apr 2007                    6
> ---------------------------------------
>
> Ced
> --
> Cedric Blancher <cedric.blancher@gmail.com>
> Institute Pasteur



-- 
Cedric Blancher <cedric.blancher@gmail.com>
Institute Pasteur



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CALXu0UdkTrJFR53VTZPJ2ENPhaGLgedPvTWsy7JVdDL_0QMSig>