Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Feb 2014 05:01:05 +0000 (UTC)
From:      Luigi Rizzo <luigi@FreeBSD.org>
To:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-10@freebsd.org
Subject:   svn commit: r262151 - in stable/10: share/man/man4 sys/conf sys/dev/e1000 sys/dev/ixgbe sys/dev/netmap sys/dev/re sys/modules/netmap sys/net tools/tools/netmap
Message-ID:  <201402180501.s1I515E3038759@svn.freebsd.org>

next in thread | raw e-mail | index | archive | help
Author: luigi
Date: Tue Feb 18 05:01:04 2014
New Revision: 262151
URL: http://svnweb.freebsd.org/changeset/base/262151

Log:
  MFH: sync the netmap code with the one in HEAD
  (enhanced VALE switch, netmap pipes, emulated netmap mode).
  See details in the log for svn 261909.

Deleted:
  stable/10/tools/tools/netmap/click-test.cfg
  stable/10/tools/tools/netmap/nm_util.c
  stable/10/tools/tools/netmap/nm_util.h
  stable/10/tools/tools/netmap/pcap.c
Modified:
  stable/10/share/man/man4/netmap.4
  stable/10/sys/conf/files
  stable/10/sys/dev/e1000/if_em.c
  stable/10/sys/dev/e1000/if_igb.c
  stable/10/sys/dev/e1000/if_lem.c
  stable/10/sys/dev/ixgbe/ixgbe.c
  stable/10/sys/dev/netmap/if_em_netmap.h
  stable/10/sys/dev/netmap/if_igb_netmap.h
  stable/10/sys/dev/netmap/if_lem_netmap.h
  stable/10/sys/dev/netmap/if_re_netmap.h
  stable/10/sys/dev/netmap/ixgbe_netmap.h
  stable/10/sys/dev/netmap/netmap.c
  stable/10/sys/dev/netmap/netmap_kern.h
  stable/10/sys/dev/netmap/netmap_mem2.c
  stable/10/sys/dev/re/if_re.c
  stable/10/sys/modules/netmap/Makefile
  stable/10/sys/net/netmap.h
  stable/10/sys/net/netmap_user.h
  stable/10/tools/tools/netmap/Makefile
  stable/10/tools/tools/netmap/README
  stable/10/tools/tools/netmap/bridge.c
  stable/10/tools/tools/netmap/pkt-gen.c
  stable/10/tools/tools/netmap/vale-ctl.c

Modified: stable/10/share/man/man4/netmap.4
==============================================================================
--- stable/10/share/man/man4/netmap.4	Tue Feb 18 04:38:26 2014	(r262150)
+++ stable/10/share/man/man4/netmap.4	Tue Feb 18 05:01:04 2014	(r262151)
@@ -1,4 +1,4 @@
-.\" Copyright (c) 2011 Matteo Landi, Luigi Rizzo, Universita` di Pisa
+.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
 .\" All rights reserved.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
@@ -21,230 +21,636 @@
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
-.\" 
+.\"
 .\" This document is derived in part from the enet man page (enet.4)
 .\" distributed with 4.3BSD Unix.
 .\"
 .\" $FreeBSD$
-.\" $Id: netmap.4 11563 2012-08-02 08:59:12Z luigi $: stable/8/share/man/man4/bpf.4 181694 2008-08-13 17:45:06Z ed $
 .\"
-.Dd September 23, 2013
+.Dd February 13, 2014
 .Dt NETMAP 4
 .Os
 .Sh NAME
 .Nm netmap
 .Nd a framework for fast packet I/O
+.br
+.Nm VALE
+.Nd a fast VirtuAl Local Ethernet using the netmap API
+.br
+.Nm netmap pipes
+.Nd a shared memory packet transport channel
 .Sh SYNOPSIS
 .Cd device netmap
 .Sh DESCRIPTION
 .Nm
-is a framework for fast and safe access to network devices
-(reaching 14.88 Mpps at less than 1 GHz).
-.Nm
-uses memory mapped buffers and metadata
-(buffer indexes and lengths) to communicate with the kernel,
-which is in charge of validating information through
-.Pa ioctl()
+is a framework for extremely fast and efficient packet I/O
+for both userspace and kernel clients.
+It runs on FreeBSD and Linux,
+and includes
+.Nm VALE ,
+a very fast and modular in-kernel software switch/dataplane,
+and
+.Nm netmap pipes ,
+a shared memory packet transport channel.
+All these are accessed interchangeably with the same API.
+.Pp
+.Nm , VALE
 and
-.Pa select()/poll().
+.Nm netmap pipes
+are at least one order of magnitude faster than
+standard OS mechanisms
+(sockets, bpf, tun/tap interfaces, native switches, pipes),
+reaching 14.88 million packets per second (Mpps)
+with much less than one core on a 10 Gbit NIC,
+about 20 Mpps per core for VALE ports,
+and over 100 Mpps for netmap pipes.
+.Pp
+Userspace clients can dynamically switch NICs into
 .Nm
-can exploit the parallelism in multiqueue devices and
-multicore systems.
+mode and send and receive raw packets through
+memory mapped buffers.
+Similarly,
+.Nm VALE
+switch instances and ports, and
+.Nm netmap pipes
+can be created dynamically,
+providing high speed packet I/O between processes,
+virtual machines, NICs and the host stack.
 .Pp
 .Nm
+suports both non-blocking I/O through
+.Xr ioctls() ,
+synchronization and blocking I/O through a file descriptor
+and standard OS mechanisms such as
+.Xr select 2 ,
+.Xr poll 2 ,
+.Xr epoll 2 ,
+.Xr kqueue 2 .
+.Nm VALE
+and
+.Nm netmap pipes
+are implemented by a single kernel module, which also emulates the
+.Nm
+API over standard drivers for devices without native
+.Nm
+support.
+For best performance,
+.Nm
 requires explicit support in device drivers.
-For a list of supported devices, see the end of this manual page.
-.Sh OPERATION
+.Pp
+In the rest of this (long) manual page we document
+various aspects of the
 .Nm
-clients must first open the
-.Pa open("/dev/netmap") ,
-and then issue an
-.Pa ioctl(...,NIOCREGIF,...)
-to bind the file descriptor to a network device.
-.Pp
-When a device is put in
-.Nm
-mode, its data path is disconnected from the host stack.
-The processes owning the file descriptor
-can exchange packets with the device, or with the host stack,
-through an mmapped memory region that contains pre-allocated
-buffers and metadata.
+and
+.Nm VALE
+architecture, features and usage.
+.Pp
+.Sh ARCHITECTURE
+.Nm
+supports raw packet I/O through a
+.Em port ,
+which can be connected to a physical interface
+.Em ( NIC ) ,
+to the host stack,
+or to a
+.Nm VALE
+switch).
+Ports use preallocated circular queues of buffers
+.Em ( rings )
+residing in an mmapped region.
+There is one ring for each transmit/receive queue of a
+NIC or virtual port.
+An additional ring pair connects to the host stack.
+.Pp
+After binding a file descriptor to a port, a
+.Nm
+client can send or receive packets in batches through
+the rings, and possibly implement zero-copy forwarding
+between ports.
+.Pp
+All NICs operating in
+.Nm
+mode use the same memory region,
+accessible to all processes who own
+.Nm /dev/netmap
+file descriptors bound to NICs.
+Independent
+.Nm VALE
+and
+.Nm netmap pipe
+ports
+by default use separate memory regions,
+but can be independently configured to share memory.
+.Pp
+.Sh ENTERING AND EXITING NETMAP MODE
+The following section describes the system calls to create
+and control
+.Nm netmap 
+ports (including
+.Nm VALE
+and
+.Nm netmap pipe
+ports).
+Simpler, higher level functions are described in section
+.Xr LIBRARIES .
+.Pp
+Ports and rings are created and controlled through a file descriptor,
+created by opening a special device
+.Dl fd = open("/dev/netmap");
+and then bound to a specific port with an
+.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
+.Pp
+.Nm
+has multiple modes of operation controlled by the
+.Vt struct nmreq
+argument.
+.Va arg.nr_name
+specifies the port name, as follows:
+.Bl -tag -width XXXX
+.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
+the data path of the NIC is disconnected from the host stack,
+and the file descriptor is bound to the NIC (one or all queues),
+or to the host stack;
+.It Dv valeXXX:YYY (arbitrary XXX and YYY)
+the file descriptor is bound to port YYY of a VALE switch called XXX,
+both dynamically created if necessary.
+The string cannot exceed IFNAMSIZ characters, and YYY cannot
+be the name of any existing OS network interface.
+.El
+.Pp
+On return,
+.Va arg
+indicates the size of the shared memory region,
+and the number, size and location of all the
+.Nm
+data structures, which can be accessed by mmapping the memory
+.Dl char *mem = mmap(0, arg.nr_memsize, fd);
 .Pp
 Non blocking I/O is done with special
-.Pa ioctl()'s ,
-whereas the file descriptor can be passed to
-.Pa select()/poll()
-to be notified about incoming packet or available transmit buffers.
-.Ss Data structures
-All data structures for all devices in
-.Nm
-mode are in a memory
-region shared by the kernel and all processes
-who open
-.Pa /dev/netmap
-(NOTE: visibility may be restricted in future implementations).
-All references between the shared data structure
-are relative (offsets or indexes). Some macros help converting
-them into actual pointers.
+.Xr ioctl 2
+.Xr select 2
+and
+.Xr poll 2
+on the file descriptor permit blocking I/O.
+.Xr epoll 2
+and
+.Xr kqueue 2
+are not supported on
+.Nm
+file descriptors.
 .Pp
-The data structures in shared memory are the following:
+While a NIC is in
+.Nm
+mode, the OS will still believe the interface is up and running.
+OS-generated packets for that NIC end up into a
+.Nm
+ring, and another ring is used to send packets into the OS network stack.
+A
+.Xr close 2
+on the file descriptor removes the binding,
+and returns the NIC to normal mode (reconnecting the data path
+to the host stack), or destroys the virtual port.
+.Pp
+.Sh DATA STRUCTURES
+The data structures in the mmapped memory region are detailed in
+.Xr sys/net/netmap.h ,
+which is the ultimate reference for the
+.Nm
+API. The main structures and fields are indicated below:
 .Bl -tag -width XXX
 .It Dv struct netmap_if (one per interface)
-indicates the number of rings supported by an interface, their
-sizes, and the offsets of the
-.Pa netmap_rings
-associated to the interface.
-The offset of a
-.Pa struct netmap_if
-in the shared memory region is indicated by the
-.Pa nr_offset
-field in the structure returned by the
-.Pa NIOCREGIF
-(see below).
 .Bd -literal
 struct netmap_if {
-    char ni_name[IFNAMSIZ]; /* name of the interface. */
-    const u_int ni_num_queues; /* number of hw ring pairs */
-    const ssize_t   ring_ofs[]; /* offset of tx and rx rings */
+    ...
+    const uint32_t   ni_flags;      /* properties              */
+    ...
+    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
+    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
+    uint32_t         ni_bufs_head;  /* head of extra bufs list */
+    ...
 };
 .Ed
+.Pp
+Indicates the number of available rings
+.Pa ( struct netmap_rings )
+and their position in the mmapped region.
+The number of tx and rx rings
+.Pa ( ni_tx_rings , ni_rx_rings )
+normally depends on the hardware.
+NICs also have an extra tx/rx ring pair connected to the host stack.
+.Em NIOCREGIF
+can also request additional unbound buffers in the same memory space,
+to be used as temporary storage for packets.
+.Pa ni_bufs_head
+contains the index of the first of these free rings,
+which are connected in a list (the first uint32_t of each
+buffer being the index of the next buffer in the list).
+A 0 indicates the end of the list.
+.Pp
 .It Dv struct netmap_ring (one per ring)
-contains the index of the current read or write slot (cur),
-the number of slots available for reception or transmission (avail),
-and an array of
-.Pa slots
-describing the buffers.
-There is one ring pair for each of the N hardware ring pairs
-supported by the card (numbered 0..N-1), plus
-one ring pair (numbered N) for packets from/to the host stack.
 .Bd -literal
 struct netmap_ring {
-    const ssize_t buf_ofs;
-    const uint32_t num_slots; /* number of slots in the ring. */
-    uint32_t avail;           /* number of usable slots */
-    uint32_t cur;             /* 'current' index for the user side */
-    uint32_t reserved;        /* not refilled before current */
-
-    const uint16_t nr_buf_size;
-    uint16_t flags;
-    struct netmap_slot slot[0]; /* array of slots. */
+    ...
+    const uint32_t num_slots;   /* slots in each ring            */
+    const uint32_t nr_buf_size; /* size of each buffer           */
+    ...
+    uint32_t       head;        /* (u) first buf owned by user   */
+    uint32_t       cur;         /* (u) wakeup position           */
+    const uint32_t tail;        /* (k) first buf owned by kernel */
+    ...
+    uint32_t       flags;
+    struct timeval ts;          /* (k) time of last rxsync()     */
+    ...
+    struct netmap_slot slot[0]; /* array of slots                */
 }
 .Ed
-.It Dv struct netmap_slot (one per packet)
-contains the metadata for a packet: a buffer index (buf_idx),
-a buffer length (len), and some flags.
+.Pp
+Implements transmit and receive rings, with read/write
+pointers, metadata and and an array of
+.Pa slots
+describing the buffers.
+.Pp
+.It Dv struct netmap_slot (one per buffer)
 .Bd -literal
 struct netmap_slot {
-    uint32_t buf_idx; /* buffer index */
-    uint16_t len;   /* packet length */
-    uint16_t flags; /* buf changed, etc. */
-#define NS_BUF_CHANGED  0x0001  /* must resync, buffer changed */
-#define NS_REPORT       0x0002  /* tell hw to report results
-                                 * e.g. by generating an interrupt
-                                 */
+    uint32_t buf_idx;           /* buffer index                 */
+    uint16_t len;               /* packet length                */
+    uint16_t flags;             /* buf changed, etc.            */
+    uint64_t ptr;               /* address for indirect buffers */
 };
 .Ed
+.Pp
+Describes a packet buffer, which normally is identified by
+an index and resides in the mmapped region.
 .It Dv packet buffers
-are fixed size (approximately 2k) buffers allocated by the kernel
-that contain packet data. Buffers addresses are computed through
-macros.
+Fixed size (normally 2 KB) packet buffers allocated by the kernel.
 .El
 .Pp
-Some macros support the access to objects in the shared memory
-region. In particular:
+The offset of the
+.Pa struct netmap_if
+in the mmapped region is indicated by the
+.Pa nr_offset
+field in the structure returned by
+.Pa NIOCREGIF .
+From there, all other objects are reachable through
+relative references (offsets or indexes).
+Macros and functions in <net/netmap_user.h>
+help converting them into actual pointers:
+.Pp
+.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
+.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
+.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
+.Pp
+.Dl char *buf = NETMAP_BUF(ring, buffer_index);
+.Sh RINGS, BUFFERS AND DATA I/O
+.Va Rings
+are circular queues of packets with three indexes/pointers
+.Va ( head , cur , tail ) ;
+one slot is always kept empty.
+The ring size
+.Va ( num_slots )
+should not be assumed to be a power of two.
+.br
+(NOTE: older versions of netmap used head/count format to indicate
+the content of a ring).
+.Pp
+.Va head
+is the first slot available to userspace;
+.br
+.Va cur
+is the wakeup point:
+select/poll will unblock when
+.Va tail
+passes
+.Va cur ;
+.br
+.Va tail
+is the first slot reserved to the kernel.
+.Pp
+Slot indexes MUST only move forward;
+for convenience, the function
+.Dl nm_ring_next(ring, index)
+returns the next index modulo the ring size.
+.Pp
+.Va head
+and
+.Va cur
+are only modified by the user program;
+.Va tail
+is only modified by the kernel.
+The kernel only reads/writes the
+.Vt struct netmap_ring
+slots and buffers
+during the execution of a netmap-related system call.
+The only exception are slots (and buffers) in the range
+.Va tail\  . . . head-1 ,
+that are explicitly assigned to the kernel.
+.Pp
+.Ss TRANSMIT RINGS
+On transmit rings, after a
+.Nm
+system call, slots in the range
+.Va head\  . . . tail-1
+are available for transmission.
+User code should fill the slots sequentially
+and advance
+.Va head
+and
+.Va cur
+past slots ready to transmit.
+.Va cur
+may be moved further ahead if the user code needs
+more slots before further transmissions (see
+.Sx SCATTER GATHER I/O ) .
+.Pp
+At the next NIOCTXSYNC/select()/poll(),
+slots up to
+.Va head-1
+are pushed to the port, and
+.Va tail
+may advance if further slots have become available.
+Below is an example of the evolution of a TX ring:
+.Pp
 .Bd -literal
-struct netmap_if *nifp;
-struct netmap_ring *txring = NETMAP_TXRING(nifp, i);
-struct netmap_ring *rxring = NETMAP_RXRING(nifp, i);
-int i = txring->slot[txring->cur].buf_idx;
-char *buf = NETMAP_BUF(txring, i);
+    after the syscall, slots between cur and tail are (a)vailable
+              head=cur   tail
+               |          |
+               v          v
+     TX  [.....aaaaaaaaaaa.............]
+
+    user creates new packets to (T)ransmit
+                head=cur tail
+                    |     |
+                    v     v
+     TX  [.....TTTTTaaaaaa.............]
+
+    NIOCTXSYNC/poll()/select() sends packets and reports new slots
+                head=cur      tail
+                    |          |
+                    v          v
+     TX  [..........aaaaaaaaaaa........]
+.Ed
+.Pp
+select() and poll() wlll block if there is no space in the ring, i.e.
+.Dl ring->cur == ring->tail
+and return when new slots have become available.
+.Pp
+High speed applications may want to amortize the cost of system calls
+by preparing as many packets as possible before issuing them.
+.Pp
+A transmit ring with pending transmissions has
+.Dl ring->head != ring->tail + 1 (modulo the ring size).
+The function
+.Va int nm_tx_pending(ring)
+implements this test.
+.Pp
+.Ss RECEIVE RINGS
+On receive rings, after a
+.Nm
+system call, the slots in the range
+.Va head\& . . . tail-1
+contain received packets.
+User code should process them and advance
+.Va head
+and
+.Va cur
+past slots it wants to return to the kernel.
+.Va cur
+may be moved further ahead if the user code wants to
+wait for more packets
+without returning all the previous slots to the kernel.
+.Pp
+At the next NIOCRXSYNC/select()/poll(),
+slots up to
+.Va head-1
+are returned to the kernel for further receives, and
+.Va tail
+may advance to report new incoming packets.
+.br
+Below is an example of the evolution of an RX ring:
+.Bd -literal
+    after the syscall, there are some (h)eld and some (R)eceived slots
+           head  cur     tail
+            |     |       |
+            v     v       v
+     RX  [..hhhhhhRRRRRRRR..........]
+
+    user advances head and cur, releasing some slots and holding others
+               head cur  tail
+                 |  |     |
+                 v  v     v
+     RX  [..*****hhhRRRRRR...........]
+
+    NICRXSYNC/poll()/select() recovers slots and reports new packets
+               head cur        tail
+                 |  |           |
+                 v  v           v
+     RX  [.......hhhRRRRRRRRRRRR....]
 .Ed
+.Pp
+.Sh SLOTS AND PACKET BUFFERS
+Normally, packets should be stored in the netmap-allocated buffers
+assigned to slots when ports are bound to a file descriptor.
+One packet is fully contained in a single buffer.
+.Pp
+The following flags affect slot and buffer processing:
+.Bl -tag -width XXX
+.It NS_BUF_CHANGED
+it MUST be used when the buf_idx in the slot is changed.
+This can be used to implement
+zero-copy forwarding, see
+.Sx ZERO-COPY FORWARDING .
+.Pp
+.It NS_REPORT
+reports when this buffer has been transmitted.
+Normally,
+.Nm
+notifies transmit completions in batches, hence signals
+can be delayed indefinitely. This flag helps detecting
+when packets have been send and a file descriptor can be closed.
+.It NS_FORWARD
+When a ring is in 'transparent' mode (see
+.Sx TRANSPARENT MODE ) ,
+packets marked with this flags are forwarded to the other endpoint
+at the next system call, thus restoring (in a selective way)
+the connection between a NIC and the host stack.
+.It NS_NO_LEARN
+tells the forwarding code that the SRC MAC address for this
+packet must not be used in the learning bridge code.
+.It NS_INDIRECT
+indicates that the packet's payload is in a user-supplied buffer,
+whose user virtual address is in the 'ptr' field of the slot.
+The size can reach 65535 bytes.
+.br
+This is only supported on the transmit ring of
+.Nm VALE
+ports, and it helps reducing data copies in the interconnection
+of virtual machines.
+.It NS_MOREFRAG
+indicates that the packet continues with subsequent buffers;
+the last buffer in a packet must have the flag clear.
+.El
+.Sh SCATTER GATHER I/O
+Packets can span multiple slots if the
+.Va NS_MOREFRAG
+flag is set in all but the last slot.
+The maximum length of a chain is 64 buffers.
+This is normally used with
+.Nm VALE
+ports when connecting virtual machines, as they generate large
+TSO segments that are not split unless they reach a physical device.
+.Pp
+NOTE: The length field always refers to the individual
+fragment; there is no place with the total length of a packet.
+.Pp
+On receive rings the macro
+.Va NS_RFRAGS(slot)
+indicates the remaining number of slots for this packet,
+including the current one.
+Slots with a value greater than 1 also have NS_MOREFRAG set.
 .Sh IOCTLS
 .Nm
-supports some ioctl() to synchronize the state of the rings
-between the kernel and the user processes, plus some
-to query and configure the interface.
-The former do not require any argument, whereas the latter
-use a
-.Pa struct netmap_req
-defined as follows:
+uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
+for non-blocking I/O. They take no argument.
+Two more ioctls (NIOCGINFO, NIOCREGIF) are used
+to query and configure ports, with the following argument:
 .Bd -literal
 struct nmreq {
-        char      nr_name[IFNAMSIZ];
-        uint32_t  nr_version;     /* API version */
-#define NETMAP_API      3         /* current version */
-        uint32_t  nr_offset;      /* nifp offset in the shared region */
-        uint32_t  nr_memsize;     /* size of the shared region */
-        uint32_t  nr_tx_slots;    /* slots in tx rings */
-        uint32_t  nr_rx_slots;    /* slots in rx rings */
-        uint16_t  nr_tx_rings;    /* number of tx rings */
-        uint16_t  nr_rx_rings;    /* number of tx rings */
-        uint16_t  nr_ringid;      /* ring(s) we care about */
-#define NETMAP_HW_RING  0x4000    /* low bits indicate one hw ring */
-#define NETMAP_SW_RING  0x2000    /* we process the sw ring */
-#define NETMAP_NO_TX_POLL 0x1000  /* no gratuitous txsync on poll */
-#define NETMAP_RING_MASK 0xfff    /* the actual ring number */
-        uint16_t        spare1;
-        uint32_t        spare2[4];
+    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
+    uint32_t  nr_version;        /* (i) API version                */
+    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
+    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
+    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
+    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
+    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
+    uint16_t  nr_rx_rings;       /* (i/o) number of tx rings       */
+    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
+    uint16_t  nr_cmd;            /* (i) special command            */
+    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
+    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
+    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
+    uint32_t  nr_flags           /* (i/o) open mode                */
+    ...
 };
-
 .Ed
-A device descriptor obtained through
+.Pp
+A file descriptor obtained through
 .Pa /dev/netmap
-also supports the ioctl supported by network devices.
+also supports the ioctl supported by network devices, see
+.Xr netintro 4 .
 .Pp
-The netmap-specific
-.Xr ioctl 2
-command codes below are defined in
-.In net/netmap.h
-and are:
 .Bl -tag -width XXXX
 .It Dv NIOCGINFO
-returns information about the interface named in nr_name.
-On return, nr_memsize indicates the size of the shared netmap
-memory region (this is device-independent),
-nr_tx_slots and nr_rx_slots indicates how many buffers are in a
-transmit and receive ring,
-nr_tx_rings and nr_rx_rings indicates the number of transmit
-and receive rings supported by the hardware.
+returns EINVAL if the named port does not support netmap.
+Otherwise, it returns 0 and (advisory) information
+about the port.
+Note that all the information below can change before the
+interface is actually put in netmap mode.
 .Pp
-If the device does not support netmap, the ioctl returns EINVAL.
+.Bl -tag -width XX
+.It Pa nr_memsize
+indicates the size of the
+.Nm
+memory region. NICs in
+.Nm
+mode all share the same memory region,
+whereas
+.Nm VALE
+ports have independent regions for each port.
+.It Pa nr_tx_slots , nr_rx_slots
+indicate the size of transmit and receive rings.
+.It Pa nr_tx_rings , nr_rx_rings
+indicate the number of transmit
+and receive rings.
+Both ring number and sizes may be configured at runtime
+using interface-specific functions (e.g.
+.Xr ethtool
+).
+.El
 .It Dv NIOCREGIF
-puts the interface named in nr_name into netmap mode, disconnecting
-it from the host stack, and/or defines which rings are controlled
-through this file descriptor.
-On return, it gives the same info as NIOCGINFO, and nr_ringid
-indicates the identity of the rings controlled through the file
+binds the port named in
+.Va nr_name
+to the file descriptor. For a physical device this also switches it into
+.Nm
+mode, disconnecting
+it from the host stack.
+Multiple file descriptors can be bound to the same port,
+with proper synchronization left to the user.
+.Pp
+.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
+.Em netmap pipe ,
+consisting of two netmap ports with a crossover connection.
+A netmap pipe share the same memory space of the parent port,
+and is meant to enable configuration where a master process acts
+as a dispatcher towards slave processes.
+.Pp
+To enable this function, the
+.Pa nr_arg1
+field of the structure can be used as a hint to the kernel to
+indicate how many pipes we expect to use, and reserve extra space
+in the memory region.
+.Pp
+On return, it gives the same info as NIOCGINFO,
+with
+.Pa nr_ringid
+and
+.Pa nr_flags
+indicating the identity of the rings controlled through the file
 descriptor.
 .Pp
-Possible values for nr_ringid are
+.Va nr_flags
+.Va nr_ringid
+selects which rings are controlled through this file descriptor.
+Possible values of
+.Pa nr_flags
+are indicated below, together with the naming schemes
+that application libraries (such as the
+.Nm nm_open
+indicated below) can use to indicate the specific set of rings.
+In the example below, "netmap:foo" is any valid netmap port name.
+.Pp
 .Bl -tag -width XXXXX
-.It 0
-default, all hardware rings
-.It NETMAP_SW_RING
-the ``host rings'' connecting to the host stack
-.It NETMAP_HW_RING + i
-the i-th hardware ring
+.It NR_REG_ALL_NIC                         "netmap:foo"
+(default) all hardware ring pairs
+.It NR_REG_SW_NIC           "netmap:foo^"
+the ``host rings'', connecting to the host stack.
+.It NR_RING_NIC_SW        "netmap:foo+
+all hardware rings and the host rings
+.It NR_REG_ONE_NIC       "netmap:foo-i"
+only the i-th hardware ring pair, where the number is in
+.Pa nr_ringid ;
+.It NR_REG_PIPE_MASTER  "netmap:foo{i"
+the master side of the netmap pipe whose identifier (i) is in
+.Pa nr_ringid ;
+.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
+the slave side of the netmap pipe whose identifier (i) is in
+.Pa nr_ringid .
+.Pp
+The identifier of a pipe must be thought as part of the pipe name,
+and does not need to be sequential. On return the pipe
+will only have a single ring pair with index 0,
+irrespective of the value of i.
 .El
+.Pp
 By default, a
-.Nm poll
+.Xr poll 2
 or
-.Nm select
+.Xr select 2
 call pushes out any pending packets on the transmit ring, even if
 no write events are specified.
 The feature can be disabled by or-ing
-.Nm NETMAP_NO_TX_SYNC
-to nr_ringid.
-But normally you should keep this feature unless you are using
-separate file descriptors for the send and receive rings, because
-otherwise packets are pushed out only if NETMAP_TXSYNC is called,
-or the send queue is full.
-.Pp
-.Pa NIOCREGIF
-can be used multiple times to change the association of a
-file descriptor to a ring pair, always within the same device.
-.It Dv NIOCUNREGIF
-brings an interface back to normal mode.
+.Va NETMAP_NO_TX_SYNC
+to the value written to
+.Va nr_ringid.
+When this feature is used,
+packets are transmitted only on
+.Va ioctl(NIOCTXSYNC)
+or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
+.Pp
+When registering a virtual interface that is dynamically created to a
+.Xr vale 4
+switch, we can specify the desired number of rings (1 by default,
+and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
 .It Dv NIOCTXSYNC
 tells the hardware of new packets to transmit, and updates the
 number of slots available for transmission.
@@ -252,54 +658,387 @@ number of slots available for transmissi
 tells the hardware of consumed packets, and asks for newly available
 packets.
 .El
+.Sh SELECT, POLL, EPOLL, KQUEUE.
+.Xr select 2
+and
+.Xr poll 2
+on a
+.Nm
+file descriptor process rings as indicated in
+.Sx TRANSMIT RINGS
+and
+.Sx RECEIVE RINGS ,
+respectively when write (POLLOUT) and read (POLLIN) events are requested.
+Both block if no slots are available in the ring
+.Va ( ring->cur == ring->tail ) .
+Depending on the platform,
+.Xr epoll 2
+and
+.Xr kqueue 2
+are supported too.
+.Pp
+Packets in transmit rings are normally pushed out
+(and buffers reclaimed) even without
+requesting write events. Passing the NETMAP_NO_TX_SYNC flag to
+.Em NIOCREGIF
+disables this feature.
+By default, receive rings are processed only if read
+events are requested. Passing the NETMAP_DO_RX_SYNC flag to
+.Em NIOCREGIF updates receive rings even without read events.
+Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC
+only have an effect when some event is posted for the file descriptor.
+.Sh LIBRARIES
+The
+.Nm
+API is supposed to be used directly, both because of its simplicity and
+for efficient integration with applications.
+.Pp
+For conveniency, the
+.Va <net/netmap_user.h>
+header provides a few macros and functions to ease creating
+a file descriptor and doing I/O with a
+.Nm
+port. These are loosely modeled after the
+.Xr pcap 3
+API, to ease porting of libpcap-based applications to
+.Nm .
+To use these extra functions, programs should
+.Dl #define NETMAP_WITH_LIBS
+before
+.Dl #include <net/netmap_user.h>
+.Pp
+The following functions are available:
+.Bl -tag -width XXXXX
+.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
+similar to
+.Xr pcap_open ,
+binds a file descriptor to a port.
+.Bl -tag -width XX
+.It Va ifname
+is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
+.Nm VALE
+port.
+.It Va req
+provides the initial values for the argument to the NIOCREGIF ioctl.
+The nm_flags and nm_ringid values are overwritten by parsing
+ifname and flags, and other fields can be overridden through
+the other two arguments.
+.It Va arg
+points to a struct nm_desc containing arguments (e.g. from a previously
+open file descriptor) that should override the defaults.
+The fields are used as described below
+.It Va flags
+can be set to a combination of the following flags:
+.Va NETMAP_NO_TX_POLL ,
+.Va NETMAP_DO_RX_POLL
+(copied into nr_ringid);
+.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
+avoids the mmap and uses the values from it);
+.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
+.Va NM_OPEN_ARG1 ,
+.Va NM_OPEN_ARG2 ,
+.Va NM_OPEN_ARG3 (uses the fields from arg);
+.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
+.El
+.It Va int nm_close(struct nm_desc *d)
+closes the file descriptor, unmaps memory, frees resources.
+.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
+similar to pcap_inject(), pushes a packet to a ring, returns the size
+of the packet is successful, or 0 on error;
+.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
+similar to pcap_dispatch(), applies a callback to incoming packets
+.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
+similar to pcap_next(), fetches the next packet
+.Pp
+.El
+.Sh SUPPORTED DEVICES
+.Nm
+natively supports the following devices:
+.Pp
+On FreeBSD:
+.Xr em 4 ,
+.Xr igb 4 ,
+.Xr ixgbe 4 ,
+.Xr lem 4 ,
+.Xr re 4 .
+.Pp
+On Linux
+.Xr e1000 4 ,
+.Xr e1000e 4 ,
+.Xr igb 4 ,
+.Xr ixgbe 4 ,
+.Xr mlx4 4 ,
+.Xr forcedeth 4 ,
+.Xr r8169 4 .
+.Pp
+NICs without native support can still be used in
+.Nm
+mode through emulation. Performance is inferior to native netmap
+mode but still significantly higher than sockets, and approaching
+that of in-kernel solutions such as Linux's
+.Xr pktgen .
+.Pp
+Emulation is also available for devices with native netmap support,
+which can be used for testing or performance comparison.
+The sysctl variable
+.Va dev.netmap.admode
+globally controls how netmap mode is implemented.
+.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
+Some aspect of the operation of
+.Nm
+are controlled through sysctl variables on FreeBSD
+.Em ( dev.netmap.* )
+and module parameters on Linux
+.Em ( /sys/module/netmap_lin/parameters/* ) :
+.Pp
+.Bl -tag -width indent
+.It Va dev.netmap.admode: 0
+Controls the use of native or emulated adapter mode.
+0 uses the best available option, 1 forces native and
+fails if not available, 2 forces emulated hence never fails.
+.It Va dev.netmap.generic_ringsize: 1024
+Ring size used for emulated netmap mode
+.It Va dev.netmap.generic_mit: 100000
+Controls interrupt moderation for emulated mode
+.It Va dev.netmap.mmap_unreg: 0
+.It Va dev.netmap.fwd: 0
+Forces NS_FORWARD mode
+.It Va dev.netmap.flags: 0
+.It Va dev.netmap.txsync_retry: 2
+.It Va dev.netmap.no_pendintr: 1
+Forces recovery of transmit buffers on system calls
+.It Va dev.netmap.mitigate: 1
+Propagates interrupt mitigation to user processes
+.It Va dev.netmap.no_timestamp: 0
+Disables the update of the timestamp in the netmap ring
+.It Va dev.netmap.verbose: 0
+Verbose kernel messages
+.It Va dev.netmap.buf_num: 163840
+.It Va dev.netmap.buf_size: 2048
+.It Va dev.netmap.ring_num: 200
+.It Va dev.netmap.ring_size: 36864
+.It Va dev.netmap.if_num: 100
+.It Va dev.netmap.if_size: 1024
+Sizes and number of objects (netmap_if, netmap_ring, buffers)
+for the global memory region. The only parameter worth modifying is
+.Va dev.netmap.buf_num
+as it impacts the total amount of memory used by netmap.
+.It Va dev.netmap.buf_curr_num: 0
+.It Va dev.netmap.buf_curr_size: 0
+.It Va dev.netmap.ring_curr_num: 0
+.It Va dev.netmap.ring_curr_size: 0
+.It Va dev.netmap.if_curr_num: 0
+.It Va dev.netmap.if_curr_size: 0
+Actual values in use.
+.It Va dev.netmap.bridge_batch: 1024
+Batch size used when moving packets across a
+.Nm VALE
+switch. Values above 64 generally guarantee good
+performance.
+.El
 .Sh SYSTEM CALLS
 .Nm
 uses
-.Nm select
+.Xr select 2 ,
+.Xr poll 2 ,
+.Xr epoll
 and
-.Nm poll
-to wake up processes when significant events occur.
+.Xr kqueue
+to wake up processes when significant events occur, and
+.Xr mmap 2

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201402180501.s1I515E3038759>