From owner-svn-src-head@FreeBSD.ORG Fri Nov 1 21:21:17 2013 Return-Path: Delivered-To: svn-src-head@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 0866BC36; Fri, 1 Nov 2013 21:21:17 +0000 (UTC) (envelope-from luigi@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:1900:2254:2068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id E63022922; Fri, 1 Nov 2013 21:21:16 +0000 (UTC) Received: from svn.freebsd.org ([127.0.1.70]) by svn.freebsd.org (8.14.7/8.14.7) with ESMTP id rA1LLGvN058185; Fri, 1 Nov 2013 21:21:16 GMT (envelope-from luigi@svn.freebsd.org) Received: (from luigi@localhost) by svn.freebsd.org (8.14.7/8.14.5/Submit) id rA1LLFCd058167; Fri, 1 Nov 2013 21:21:15 GMT (envelope-from luigi@svn.freebsd.org) Message-Id: <201311012121.rA1LLFCd058167@svn.freebsd.org> From: Luigi Rizzo Date: Fri, 1 Nov 2013 21:21:15 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r257529 - in head: share/man/man4 sys/conf sys/dev/e1000 sys/dev/ixgbe sys/dev/netmap sys/dev/re sys/net tools/tools/netmap X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Nov 2013 21:21:17 -0000 Author: luigi Date: Fri Nov 1 21:21:14 2013 New Revision: 257529 URL: http://svnweb.freebsd.org/changeset/base/257529 Log: update to the latest netmap snapshot. This includes the following: - use separate memory regions for VALE ports - locking fixes - some simplifications in the NIC-specific routines - performance improvements for the VALE switch - some new features in the pkt-gen test program - documentation updates There are small API changes that require programs to be recompiled (NETMAP_API has been bumped so you will detect old binaries at runtime). In particular: - struct netmap_slot now is 16 bytes to support an extra pointer, which may save one data copy when using VALE ports or VMs; - the struct netmap_if has two extra fields; MFC after: 3 days Modified: head/share/man/man4/netmap.4 head/sys/conf/files head/sys/dev/e1000/if_em.c head/sys/dev/e1000/if_igb.c head/sys/dev/e1000/if_lem.c head/sys/dev/e1000/if_lem.h head/sys/dev/ixgbe/ixgbe.c head/sys/dev/netmap/if_em_netmap.h head/sys/dev/netmap/if_igb_netmap.h head/sys/dev/netmap/if_lem_netmap.h head/sys/dev/netmap/if_re_netmap.h head/sys/dev/netmap/ixgbe_netmap.h head/sys/dev/netmap/netmap.c head/sys/dev/netmap/netmap_kern.h head/sys/dev/netmap/netmap_mem2.c head/sys/dev/re/if_re.c head/sys/net/netmap.h head/tools/tools/netmap/nm_util.c head/tools/tools/netmap/pkt-gen.c Modified: head/share/man/man4/netmap.4 ============================================================================== --- head/share/man/man4/netmap.4 Fri Nov 1 21:17:45 2013 (r257528) +++ head/share/man/man4/netmap.4 Fri Nov 1 21:21:14 2013 (r257529) @@ -1,4 +1,4 @@ -.\" Copyright (c) 2011 Matteo Landi, Luigi Rizzo, Universita` di Pisa +.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without @@ -21,14 +21,13 @@ .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. -.\" +.\" .\" This document is derived in part from the enet man page (enet.4) .\" distributed with 4.3BSD Unix. .\" .\" $FreeBSD$ -.\" $Id: netmap.4 11563 2012-08-02 08:59:12Z luigi $: stable/8/share/man/man4/bpf.4 181694 2008-08-13 17:45:06Z ed $ .\" -.Dd September 23, 2013 +.Dd October 18, 2013 .Dt NETMAP 4 .Os .Sh NAME @@ -38,101 +37,230 @@ .Cd device netmap .Sh DESCRIPTION .Nm -is a framework for fast and safe access to network devices -(reaching 14.88 Mpps at less than 1 GHz). -.Nm -uses memory mapped buffers and metadata -(buffer indexes and lengths) to communicate with the kernel, -which is in charge of validating information through -.Pa ioctl() -and -.Pa select()/poll(). +is a framework for extremely fast and efficient packet I/O +(reaching 14.88 Mpps with a single core at less than 1 GHz) +for both userspace and kernel clients. +Userspace clients can use the netmap API +to send and receive raw packets through physical interfaces +or ports of the +.Xr VALE 4 +switch. +.Pp +.Nm VALE +is a very fast (reaching 20 Mpps per port) +and modular software switch, +implemented within the kernel, which can interconnect +virtual ports, physical devices, and the native host stack. +.Pp +.Nm +uses a memory mapped region to share packet buffers, +descriptors and queues with the kernel. +Simple +.Pa ioctl()s +are used to bind interfaces/ports to file descriptors and +implement non-blocking I/O, whereas blocking I/O uses +.Pa select()/poll() . .Nm can exploit the parallelism in multiqueue devices and multicore systems. .Pp +For the best performance, +.Nm +requires explicit support in device drivers; +a generic emulation layer is available to implement the .Nm -requires explicit support in device drivers. -For a list of supported devices, see the end of this manual page. -.Sh OPERATION +API on top of unmodified device drivers, +at the price of reduced performance +(but still better than what can be achieved with +sockets or BPF/pcap). +.Pp +For a list of devices with native .Nm -clients must first open the +support, see the end of this manual page. +.Pp +.Sh OPERATION - THE NETMAP API +.Nm +clients must first .Pa open("/dev/netmap") , and then issue an -.Pa ioctl(...,NIOCREGIF,...) -to bind the file descriptor to a network device. -.Pp -When a device is put in +.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg) +to bind the file descriptor to a specific interface or port. .Nm -mode, its data path is disconnected from the host stack. -The processes owning the file descriptor -can exchange packets with the device, or with the host stack, -through an mmapped memory region that contains pre-allocated -buffers and metadata. +has multiple modes of operation controlled by the +content of the +.Pa struct nmreq +passed to the +.Pa ioctl() . +In particular, the +.Em nr_name +field specifies whether the client operates on a physical network +interface or on a port of a +.Nm VALE +switch, as indicated below. Additional fields in the +.Pa struct nmreq +control the details of operation. +.Pp +.Bl -tag -width XXXX +.It Dv Interface name (e.g. 'em0', 'eth1', ... ) +The data path of the interface is disconnected from the host stack. +Depending on additional arguments, +the file descriptor is bound to the NIC (one or all queues), +or to the host stack. +.It Dv valeXXX:YYY (arbitrary XXX and YYY) +The file descriptor is bound to port YYY of a VALE switch called XXX, +where XXX and YYY are arbitrary alphanumeric strings. +The string cannot exceed IFNAMSIZ characters, and YYY cannot +matching the name of any existing interface. +.Pp +The switch and the port are created if not existing. +.It Dv valeXXX:ifname (ifname is an existing interface) +Flags in the argument control whether the physical interface +(and optionally the corrisponding host stack endpoint) +are connected or disconnected from the VALE switch named XXX. .Pp +In this case the +.Pa ioctl() +is used only for configuring the VALE switch, typically through the +.Nm vale-ctl +command. +The file descriptor cannot be used for I/O, and should be +.Pa close()d +after issuing the +.Pa ioctl(). +.El +.Pp +The binding can be removed (and the interface returns to +regular operation, or the virtual port destroyed) with a +.Pa close() +on the file descriptor. +.Pp +The processes owning the file descriptor can then +.Pa mmap() +the memory region that contains pre-allocated +buffers, descriptors and queues, and use them to +read/write raw packets. Non blocking I/O is done with special .Pa ioctl()'s , whereas the file descriptor can be passed to .Pa select()/poll() to be notified about incoming packet or available transmit buffers. -.Ss Data structures -All data structures for all devices in +.Ss DATA STRUCTURES +The data structures in the mmapped memory are described below +(see +.Xr sys/net/netmap.h +for reference). +All physical devices operating in .Nm -mode are in a memory -region shared by the kernel and all processes -who open +mode use the same memory region, +shared by the kernel and all processes who own .Pa /dev/netmap +descriptors bound to those devices (NOTE: visibility may be restricted in future implementations). +Virtual ports instead use separate memory regions, +shared only with the kernel. +.Pp All references between the shared data structure are relative (offsets or indexes). Some macros help converting them into actual pointers. -.Pp -The data structures in shared memory are the following: .Bl -tag -width XXX .It Dv struct netmap_if (one per interface) indicates the number of rings supported by an interface, their sizes, and the offsets of the .Pa netmap_rings associated to the interface. -The offset of a +.Pp .Pa struct netmap_if -in the shared memory region is indicated by the +is at offset .Pa nr_offset +in the shared memory region is indicated by the field in the structure returned by the .Pa NIOCREGIF (see below). .Bd -literal struct netmap_if { - char ni_name[IFNAMSIZ]; /* name of the interface. */ - const u_int ni_num_queues; /* number of hw ring pairs */ - const ssize_t ring_ofs[]; /* offset of tx and rx rings */ + char ni_name[IFNAMSIZ]; /* name of the interface. */ + const u_int ni_version; /* API version */ + const u_int ni_rx_rings; /* number of rx ring pairs */ + const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */ + const ssize_t ring_ofs[]; /* offset of tx and rx rings */ }; .Ed .It Dv struct netmap_ring (one per ring) -contains the index of the current read or write slot (cur), -the number of slots available for reception or transmission (avail), +Contains the positions in the transmit and receive rings to +synchronize the kernel and the application, and an array of .Pa slots describing the buffers. -There is one ring pair for each of the N hardware ring pairs -supported by the card (numbered 0..N-1), plus -one ring pair (numbered N) for packets from/to the host stack. +'reserved' is used in receive rings to tell the kernel the +number of slots after 'cur' that are still in usr +indicates how many slots starting from 'cur' +the +.Pp +Each physical interface has one +.Pa netmap_ring +for each hardware transmit and receive ring, +plus one extra transmit and one receive structure +that connect to the host stack. .Bd -literal struct netmap_ring { - const ssize_t buf_ofs; - const uint32_t num_slots; /* number of slots in the ring. */ - uint32_t avail; /* number of usable slots */ - uint32_t cur; /* 'current' index for the user side */ - uint32_t reserved; /* not refilled before current */ + const ssize_t buf_ofs; /* see details */ + const uint32_t num_slots; /* number of slots in the ring */ + uint32_t avail; /* number of usable slots */ + uint32_t cur; /* 'current' read/write index */ + uint32_t reserved; /* not refilled before current */ const uint16_t nr_buf_size; - uint16_t flags; - struct netmap_slot slot[0]; /* array of slots. */ + uint16_t flags; +#define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */ +#define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */ +#define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */ + struct timeval ts; + struct netmap_slot slot[0]; /* array of slots */ } .Ed +.Pp +In transmit rings, after a system call 'cur' indicates +the first slot that can be used for transmissions, +and 'avail' reports how many of them are available. +Before the next netmap-related system call on the file +descriptor, the application should fill buffers and +slots with data, and update 'cur' and 'avail' +accordingly, as shown in the figure below: +.Bd -literal + + cur + |----- avail ---| (after syscall) + v + TX [*****aaaaaaaaaaaaaaaaa**] + TX [*****TTTTTaaaaaaaaaaaa**] + ^ + |-- avail --| (before syscall) + cur +.Ed + +In receive rings, after a system call 'cur' indicates +the first slot that contains a valid packet, +and 'avail' reports how many of them are available. +Before the next netmap-related system call on the file +descriptor, the application can process buffers and +release them to the kernel updating +'cur' and 'avail' accordingly, as shown in the figure below. +Receive rings have an additional field called 'reserved' +to indicate how many buffers before 'cur' are still +under processing and cannot be released. +.Bd -literal + cur + |-res-|-- avail --| (after syscall) + v + RX [**rrrrrrRRRRRRRRRRRR******] + RX [**...........rrrrRRR******] + |res|--|flags >> 8) & 0xff) + uint64_t ptr; /* buffer address (indirect buffers) */ }; .Ed +The flags control how the the buffer associated to the slot +should be managed. .It Dv packet buffers -are fixed size (approximately 2k) buffers allocated by the kernel +are normally fixed size (2 Kbyte) buffers allocated by the kernel that contain packet data. Buffers addresses are computed through macros. .El .Pp +.Bl -tag -width XXX Some macros support the access to objects in the shared memory -region. In particular: -.Bd -literal -struct netmap_if *nifp; -struct netmap_ring *txring = NETMAP_TXRING(nifp, i); -struct netmap_ring *rxring = NETMAP_RXRING(nifp, i); -int i = txring->slot[txring->cur].buf_idx; -char *buf = NETMAP_BUF(txring, i); -.Ed +region. In particular, +.It NETMAP_TXRING(nifp, i) +.It NETMAP_RXRING(nifp, i) +return the address of the i-th transmit and receive ring, +respectively, whereas +.It NETMAP_BUF(ring, buf_idx) +returns the address of the buffer with index buf_idx +(which can be part of any ring for the given interface). +.El +.Pp +Normally, buffers are associated to slots when interfaces are bound, +and one packet is fully contained in a single buffer. +Clients can however modify the mapping using the +following flags: +.Ss FLAGS +.Bl -tag -width XXX +.It NS_BUF_CHANGED +indicates that the buf_idx in the slot has changed. +This can be useful if the client wants to implement +some form of zero-copy forwarding (e.g. by passing buffers +from an input interface to an output interface), or +needs to process packets out of order. +.Pp +The flag MUST be used whenever the buffer index is changed. +.It NS_REPORT +indicates that we want to be woken up when this buffer +has been transmitted. This reduces performance but insures +a prompt notification when a buffer has been sent. +Normally, +.Nm +notifies transmit completions in batches, hence signals +can be delayed indefinitely. However, we need such notifications +before closing a descriptor. +.It NS_FORWARD +When the device is open in 'transparent' mode, +the client can mark slots in receive rings with this flag. +For all marked slots, marked packets are forwarded to +the other endpoint at the next system call, thus restoring +(in a selective way) the connection between the NIC and the +host stack. +.It NS_NO_LEARN +tells the forwarding code that the SRC MAC address for this +packet should not be used in the learning bridge +.It NS_INDIRECT +indicates that the packet's payload is not in the netmap +supplied buffer, but in a user-supplied buffer whose +user virtual address is in the 'ptr' field of the slot. +The size can reach 65535 bytes. +.Em This is only supported on the transmit ring of virtual ports +.It NS_MOREFRAG +indicates that the packet continues with subsequent buffers; +the last buffer in a packet must have the flag clear. +The maximum length of a chain is 64 buffers. +.Em This is only supported on virtual ports +.It ns_ctr +on receive rings, contains the number of remaining buffers +in a packet, including this one. +Slots with a value greater than 1 also have NS_MOREFRAG set. +The length refers to the individual buffer, there is no +field for the total length +XXX maybe put it in the ptr field ? +.Pp +On transmit rings, if NS_DST is set, it is passed to the lookup +function, which can use it e.g. as the index of the destination +port instead of doing an address lookup. +.El .Sh IOCTLS .Nm supports some ioctl() to synchronize the state of the rings @@ -166,13 +365,13 @@ between the kernel and the user processe to query and configure the interface. The former do not require any argument, whereas the latter use a -.Pa struct netmap_req +.Pa struct nmreq defined as follows: .Bd -literal struct nmreq { char nr_name[IFNAMSIZ]; uint32_t nr_version; /* API version */ -#define NETMAP_API 3 /* current version */ +#define NETMAP_API 4 /* current version */ uint32_t nr_offset; /* nifp offset in the shared region */ uint32_t nr_memsize; /* size of the shared region */ uint32_t nr_tx_slots; /* slots in tx rings */ @@ -184,8 +383,14 @@ struct nmreq { #define NETMAP_SW_RING 0x2000 /* we process the sw ring */ #define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */ #define NETMAP_RING_MASK 0xfff /* the actual ring number */ - uint16_t spare1; - uint32_t spare2[4]; + uint16_t nr_cmd; +#define NETMAP_BDG_ATTACH 1 /* attach the NIC */ +#define NETMAP_BDG_DETACH 2 /* detach the NIC */ +#define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */ +#define NETMAP_BDG_LIST 4 /* get bridge's info */ + uint16_t nr_arg1; + uint16_t nr_arg2; + uint32_t spare2[3]; }; .Ed @@ -200,15 +405,27 @@ command codes below are defined in and are: .Bl -tag -width XXXX .It Dv NIOCGINFO -returns information about the interface named in nr_name. -On return, nr_memsize indicates the size of the shared netmap -memory region (this is device-independent), -nr_tx_slots and nr_rx_slots indicates how many buffers are in a -transmit and receive ring, -nr_tx_rings and nr_rx_rings indicates the number of transmit -and receive rings supported by the hardware. -.Pp -If the device does not support netmap, the ioctl returns EINVAL. +returns EINVAL if the named device does not support netmap. +Otherwise, it returns 0 and (advisory) information +about the interface. +Note that all the information below can change before the +interface is actually put in netmap mode. +.Pp +.Pa nr_memsize +indicates the size of the netmap +memory region. Physical devices all share the same memory region, +whereas VALE ports may have independent regions for each port. +These sizes can be set through system-wise sysctl variables. +.Pa nr_tx_slots, nr_rx_slots +indicate the size of transmit and receive rings. +.Pa nr_tx_rings, nr_rx_rings +indicate the number of transmit +and receive rings. +Both ring number and sizes may be configured at runtime +using interface-specific functions (e.g. +.Pa sysctl +or +.Pa ethtool . .It Dv NIOCREGIF puts the interface named in nr_name into netmap mode, disconnecting it from the host stack, and/or defines which rings are controlled @@ -243,8 +460,11 @@ or the send queue is full. .Pa NIOCREGIF can be used multiple times to change the association of a file descriptor to a ring pair, always within the same device. -.It Dv NIOCUNREGIF -brings an interface back to normal mode. +.Pp +When registering a virtual interface that is dynamically created to a +.Xr vale 4 +switch, we can specify the desired number of rings (1 by default, +and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. .It Dv NIOCTXSYNC tells the hardware of new packets to transmit, and updates the number of slots available for transmission. @@ -255,10 +475,20 @@ packets. .Sh SYSTEM CALLS .Nm uses -.Nm select +.Xr select 2 and -.Nm poll -to wake up processes when significant events occur. +.Xr poll 2 +to wake up processes when significant events occur, and +.Xr mmap 2 +to map memory. +.Pp +Applications may need to create threads and bind them to +specific cores to improve performance, using standard +OS primitives, see +.Xr pthread 3 . +In particular, +.Xr pthread_setaffinity_np 3 +may be of use. .Sh EXAMPLES The following code implements a traffic generator .Pp @@ -272,10 +502,10 @@ struct nmreq nmr; fd = open("/dev/netmap", O_RDWR); bzero(&nmr, sizeof(nmr)); strcpy(nmr.nr_name, "ix0"); -nmr.nr_version = NETMAP_API; -ioctl(fd, NIOCREG, &nmr); +nmr.nm_version = NETMAP_API; +ioctl(fd, NIOCREGIF, &nmr); p = mmap(0, nmr.nr_memsize, fd); -nifp = NETMAP_IF(p, nmr.offset); +nifp = NETMAP_IF(p, nmr.nr_offset); ring = NETMAP_TXRING(nifp, 0); fds.fd = fd; fds.events = POLLOUT; @@ -312,13 +542,17 @@ Usenix ATC'12, June 2012, Boston .An -nosplit The .Nm -framework has been designed and implemented at the +framework has been originally designed and implemented at the Universita` di Pisa in 2011 by .An Luigi Rizzo , -with help from +and further extended with help from .An Matteo Landi , .An Gaetano Catalli , -.An Giuseppe Lettieri . +.An Giuseppe Lettieri , +.An Vincenzo Maffione . .Pp .Nm -has been funded by the European Commission within FP7 Project CHANGE (257422). +and +.Nm VALE +have been funded by the European Commission within FP7 Projects +CHANGE (257422) and OPENLAB (287581). Modified: head/sys/conf/files ============================================================================== --- head/sys/conf/files Fri Nov 1 21:17:45 2013 (r257528) +++ head/sys/conf/files Fri Nov 1 21:21:14 2013 (r257529) @@ -1881,6 +1881,8 @@ dev/nand/nfc_if.m optional nand dev/ncv/ncr53c500.c optional ncv dev/ncv/ncr53c500_pccard.c optional ncv pccard dev/netmap/netmap.c optional netmap +dev/netmap/netmap_mem2.c optional netmap +# compile-with "${NORMAL_C} -Wconversion -Wextra" dev/nge/if_nge.c optional nge dev/nxge/if_nxge.c optional nxge \ compile-with "${NORMAL_C} ${NO_WSELF_ASSIGN}" Modified: head/sys/dev/e1000/if_em.c ============================================================================== --- head/sys/dev/e1000/if_em.c Fri Nov 1 21:17:45 2013 (r257528) +++ head/sys/dev/e1000/if_em.c Fri Nov 1 21:21:14 2013 (r257529) @@ -3836,8 +3836,7 @@ em_txeof(struct tx_ring *txr) EM_TX_LOCK_ASSERT(txr); #ifdef DEV_NETMAP - if (netmap_tx_irq(ifp, txr->me | - (NETMAP_LOCKED_ENTER | NETMAP_LOCKED_EXIT))) + if (netmap_tx_irq(ifp, txr->me)) return; #endif /* DEV_NETMAP */ @@ -4101,7 +4100,7 @@ em_setup_receive_ring(struct rx_ring *rx sizeof(struct e1000_rx_desc), EM_DBA_ALIGN); bzero((void *)rxr->rx_base, rsize); #ifdef DEV_NETMAP - slot = netmap_reset(na, NR_RX, 0, 0); + slot = netmap_reset(na, NR_RX, rxr->me, 0); #endif /* @@ -4433,8 +4432,10 @@ em_rxeof(struct rx_ring *rxr, int count, EM_RX_LOCK(rxr); #ifdef DEV_NETMAP - if (netmap_rx_irq(ifp, rxr->me | NETMAP_LOCKED_ENTER, &processed)) + if (netmap_rx_irq(ifp, rxr->me, &processed)) { + EM_RX_UNLOCK(rxr); return (FALSE); + } #endif /* DEV_NETMAP */ for (i = rxr->next_to_check, processed = 0; count != 0;) { Modified: head/sys/dev/e1000/if_igb.c ============================================================================== --- head/sys/dev/e1000/if_igb.c Fri Nov 1 21:17:45 2013 (r257528) +++ head/sys/dev/e1000/if_igb.c Fri Nov 1 21:21:14 2013 (r257529) @@ -3962,8 +3962,7 @@ igb_txeof(struct tx_ring *txr) mtx_assert(&txr->tx_mtx, MA_OWNED); #ifdef DEV_NETMAP - if (netmap_tx_irq(ifp, txr->me | - (NETMAP_LOCKED_ENTER|NETMAP_LOCKED_EXIT))) + if (netmap_tx_irq(ifp, txr->me)) return (FALSE); #endif /* DEV_NETMAP */ @@ -4829,8 +4828,10 @@ igb_rxeof(struct igb_queue *que, int cou BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); #ifdef DEV_NETMAP - if (netmap_rx_irq(ifp, rxr->me | NETMAP_LOCKED_ENTER, &processed)) + if (netmap_rx_irq(ifp, rxr->me, &processed)) { + IGB_RX_UNLOCK(rxr); return (FALSE); + } #endif /* DEV_NETMAP */ /* Main clean loop */ Modified: head/sys/dev/e1000/if_lem.c ============================================================================== --- head/sys/dev/e1000/if_lem.c Fri Nov 1 21:17:45 2013 (r257528) +++ head/sys/dev/e1000/if_lem.c Fri Nov 1 21:21:14 2013 (r257529) @@ -2986,7 +2986,7 @@ lem_txeof(struct adapter *adapter) EM_TX_LOCK_ASSERT(adapter); #ifdef DEV_NETMAP - if (netmap_tx_irq(ifp, 0 | (NETMAP_LOCKED_ENTER|NETMAP_LOCKED_EXIT))) + if (netmap_tx_irq(ifp, 0)) return; #endif /* DEV_NETMAP */ if (adapter->num_tx_desc_avail == adapter->num_tx_desc) @@ -3455,8 +3455,10 @@ lem_rxeof(struct adapter *adapter, int c BUS_DMASYNC_POSTREAD); #ifdef DEV_NETMAP - if (netmap_rx_irq(ifp, 0 | NETMAP_LOCKED_ENTER, &rx_sent)) + if (netmap_rx_irq(ifp, 0, &rx_sent)) { + EM_RX_UNLOCK(adapter); return (FALSE); + } #endif /* DEV_NETMAP */ if (!((current_desc->status) & E1000_RXD_STAT_DD)) { Modified: head/sys/dev/e1000/if_lem.h ============================================================================== --- head/sys/dev/e1000/if_lem.h Fri Nov 1 21:17:45 2013 (r257528) +++ head/sys/dev/e1000/if_lem.h Fri Nov 1 21:21:14 2013 (r257529) @@ -265,6 +265,13 @@ #define PICOSECS_PER_TICK 20833 #define TSYNC_PORT 319 /* UDP port for the protocol */ +#ifdef NIC_PARAVIRT +#define E1000_PARA_SUBDEV 0x1101 /* special id */ +#define E1000_CSBAL 0x02830 /* csb phys. addr. low */ +#define E1000_CSBAH 0x02834 /* csb phys. addr. hi */ +#include +#endif /* NIC_PARAVIRT */ + /* * Bus dma allocation structure used by * e1000_dma_malloc and e1000_dma_free. @@ -437,6 +444,26 @@ struct adapter { boolean_t pcix_82544; boolean_t in_detach; +#ifdef NIC_SEND_COMBINING + /* 0 = idle; 1xxxx int-pending; 3xxxx int + d pending + tdt */ +#define MIT_PENDING_INT 0x10000 /* pending interrupt */ +#define MIT_PENDING_TDT 0x30000 /* both intr and tdt write are pending */ + uint32_t shadow_tdt; + uint32_t sc_enable; +#endif /* NIC_SEND_COMBINING */ +#ifdef BATCH_DISPATCH + uint32_t batch_enable; +#endif /* BATCH_DISPATCH */ + +#ifdef NIC_PARAVIRT + struct em_dma_alloc csb_mem; /* phys address */ + struct paravirt_csb *csb; /* virtual addr */ + uint32_t rx_retries; /* optimize rx loop */ + uint32_t tdt_csb_count;// XXX stat + uint32_t tdt_reg_count;// XXX stat + uint32_t tdt_int_count;// XXX stat + uint32_t guest_need_kick_count;// XXX stat +#endif /* NIC_PARAVIRT */ struct e1000_hw_stats stats; }; Modified: head/sys/dev/ixgbe/ixgbe.c ============================================================================== --- head/sys/dev/ixgbe/ixgbe.c Fri Nov 1 21:17:45 2013 (r257528) +++ head/sys/dev/ixgbe/ixgbe.c Fri Nov 1 21:21:14 2013 (r257529) @@ -3621,16 +3621,11 @@ ixgbe_txeof(struct tx_ring *txr) * means the user thread should not be woken up); * - the driver ignores tx interrupts unless netmap_mitigate=0 * or the slot has the DD bit set. - * - * When the driver has separate locks, we need to - * release and re-acquire txlock to avoid deadlocks. - * XXX see if we can find a better way. */ if (!netmap_mitigate || (kring->nr_kflags < kring->nkr_num_slots && txd[kring->nr_kflags].wb.status & IXGBE_TXD_STAT_DD)) { - netmap_tx_irq(ifp, txr->me | - (NETMAP_LOCKED_ENTER|NETMAP_LOCKED_EXIT)); + netmap_tx_irq(ifp, txr->me); } return; } @@ -4422,8 +4417,10 @@ ixgbe_rxeof(struct ix_queue *que) #ifdef DEV_NETMAP /* Same as the txeof routine: wakeup clients on intr. */ - if (netmap_rx_irq(ifp, rxr->me | NETMAP_LOCKED_ENTER, &processed)) + if (netmap_rx_irq(ifp, rxr->me, &processed)) { + IXGBE_RX_UNLOCK(rxr); return (FALSE); + } #endif /* DEV_NETMAP */ for (i = rxr->next_to_check; count != 0;) { Modified: head/sys/dev/netmap/if_em_netmap.h ============================================================================== --- head/sys/dev/netmap/if_em_netmap.h Fri Nov 1 21:17:45 2013 (r257528) +++ head/sys/dev/netmap/if_em_netmap.h Fri Nov 1 21:21:14 2013 (r257529) @@ -43,35 +43,6 @@ static void em_netmap_block_tasks(struct static void em_netmap_unblock_tasks(struct adapter *); -static void -em_netmap_lock_wrapper(struct ifnet *ifp, int what, u_int queueid) -{ - struct adapter *adapter = ifp->if_softc; - - ASSERT(queueid < adapter->num_queues); - switch (what) { - case NETMAP_CORE_LOCK: - EM_CORE_LOCK(adapter); - break; - case NETMAP_CORE_UNLOCK: - EM_CORE_UNLOCK(adapter); - break; - case NETMAP_TX_LOCK: - EM_TX_LOCK(&adapter->tx_rings[queueid]); - break; - case NETMAP_TX_UNLOCK: - EM_TX_UNLOCK(&adapter->tx_rings[queueid]); - break; - case NETMAP_RX_LOCK: - EM_RX_LOCK(&adapter->rx_rings[queueid]); - break; - case NETMAP_RX_UNLOCK: - EM_RX_UNLOCK(&adapter->rx_rings[queueid]); - break; - } -} - - // XXX do we need to block/unblock the tasks ? static void em_netmap_block_tasks(struct adapter *adapter) @@ -137,7 +108,7 @@ em_netmap_reg(struct ifnet *ifp, int ono ifp->if_capenable |= IFCAP_NETMAP; na->if_transmit = ifp->if_transmit; - ifp->if_transmit = netmap_start; + ifp->if_transmit = netmap_transmit; em_init_locked(adapter); if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) == 0) { @@ -160,7 +131,7 @@ fail: * Reconcile kernel and user view of the transmit ring. */ static int -em_netmap_txsync(struct ifnet *ifp, u_int ring_nr, int do_lock) +em_netmap_txsync(struct ifnet *ifp, u_int ring_nr, int flags) { struct adapter *adapter = ifp->if_softc; struct tx_ring *txr = &adapter->tx_rings[ring_nr]; @@ -176,8 +147,6 @@ em_netmap_txsync(struct ifnet *ifp, u_in if (k > lim) return netmap_ring_reinit(kring); - if (do_lock) - EM_TX_LOCK(txr); bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_POSTREAD); @@ -202,8 +171,6 @@ em_netmap_txsync(struct ifnet *ifp, u_in u_int len = slot->len; if (addr == netmap_buffer_base || len > NETMAP_BUF_SIZE) { - if (do_lock) - EM_TX_UNLOCK(txr); return netmap_ring_reinit(kring); } @@ -252,8 +219,6 @@ em_netmap_txsync(struct ifnet *ifp, u_in /* update avail to what the kernel knows */ ring->avail = kring->nr_hwavail; - if (do_lock) - EM_TX_UNLOCK(txr); return 0; } @@ -262,7 +227,7 @@ em_netmap_txsync(struct ifnet *ifp, u_in * Reconcile kernel and user view of the receive ring. */ static int -em_netmap_rxsync(struct ifnet *ifp, u_int ring_nr, int do_lock) +em_netmap_rxsync(struct ifnet *ifp, u_int ring_nr, int flags) { struct adapter *adapter = ifp->if_softc; struct rx_ring *rxr = &adapter->rx_rings[ring_nr]; @@ -270,16 +235,13 @@ em_netmap_rxsync(struct ifnet *ifp, u_in struct netmap_kring *kring = &na->rx_rings[ring_nr]; struct netmap_ring *ring = kring->ring; u_int j, l, n, lim = kring->nkr_num_slots - 1; - int force_update = do_lock || kring->nr_kflags & NKR_PENDINTR; + int force_update = (flags & NAF_FORCE_READ) || kring->nr_kflags & NKR_PENDINTR; u_int k = ring->cur, resvd = ring->reserved; k = ring->cur; if (k > lim) return netmap_ring_reinit(kring); - if (do_lock) - EM_RX_LOCK(rxr); - /* XXX check sync modes */ bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); @@ -334,8 +296,6 @@ em_netmap_rxsync(struct ifnet *ifp, u_in void *addr = PNMB(slot, &paddr); if (addr == netmap_buffer_base) { /* bad buf */ - if (do_lock) - EM_RX_UNLOCK(rxr); return netmap_ring_reinit(kring); } @@ -364,8 +324,6 @@ em_netmap_rxsync(struct ifnet *ifp, u_in } /* tell userspace that there are new packets */ ring->avail = kring->nr_hwavail - resvd; - if (do_lock) - EM_RX_UNLOCK(rxr); return 0; } @@ -378,12 +336,11 @@ em_netmap_attach(struct adapter *adapter bzero(&na, sizeof(na)); na.ifp = adapter->ifp; - na.separate_locks = 1; + na.na_flags = NAF_BDG_MAYSLEEP; na.num_tx_desc = adapter->num_tx_desc; na.num_rx_desc = adapter->num_rx_desc; na.nm_txsync = em_netmap_txsync; na.nm_rxsync = em_netmap_rxsync; - na.nm_lock = em_netmap_lock_wrapper; na.nm_register = em_netmap_reg; netmap_attach(&na, adapter->num_queues); } Modified: head/sys/dev/netmap/if_igb_netmap.h ============================================================================== --- head/sys/dev/netmap/if_igb_netmap.h Fri Nov 1 21:17:45 2013 (r257528) +++ head/sys/dev/netmap/if_igb_netmap.h Fri Nov 1 21:21:14 2013 (r257529) @@ -39,38 +39,6 @@ /* - * wrapper to export locks to the generic code - */ -static void -igb_netmap_lock_wrapper(struct ifnet *ifp, int what, u_int queueid) -{ - struct adapter *adapter = ifp->if_softc; - - ASSERT(queueid < adapter->num_queues); - switch (what) { - case NETMAP_CORE_LOCK: - IGB_CORE_LOCK(adapter); - break; - case NETMAP_CORE_UNLOCK: - IGB_CORE_UNLOCK(adapter); - break; - case NETMAP_TX_LOCK: - IGB_TX_LOCK(&adapter->tx_rings[queueid]); - break; - case NETMAP_TX_UNLOCK: - IGB_TX_UNLOCK(&adapter->tx_rings[queueid]); - break; - case NETMAP_RX_LOCK: - IGB_RX_LOCK(&adapter->rx_rings[queueid]); - break; - case NETMAP_RX_UNLOCK: - IGB_RX_UNLOCK(&adapter->rx_rings[queueid]); - break; - } -} - - -/* * register-unregister routine */ static int @@ -92,7 +60,7 @@ igb_netmap_reg(struct ifnet *ifp, int on ifp->if_capenable |= IFCAP_NETMAP; na->if_transmit = ifp->if_transmit; - ifp->if_transmit = netmap_start; + ifp->if_transmit = netmap_transmit; igb_init_locked(adapter); if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) == 0) { @@ -114,7 +82,7 @@ fail: * Reconcile kernel and user view of the transmit ring. */ static int -igb_netmap_txsync(struct ifnet *ifp, u_int ring_nr, int do_lock) +igb_netmap_txsync(struct ifnet *ifp, u_int ring_nr, int flags) { struct adapter *adapter = ifp->if_softc; struct tx_ring *txr = &adapter->tx_rings[ring_nr]; @@ -130,8 +98,6 @@ igb_netmap_txsync(struct ifnet *ifp, u_i if (k > lim) return netmap_ring_reinit(kring); - if (do_lock) - IGB_TX_LOCK(txr); bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_POSTREAD); @@ -153,6 +119,13 @@ igb_netmap_txsync(struct ifnet *ifp, u_i /* curr is the current slot in the nic ring */ union e1000_adv_tx_desc *curr = (union e1000_adv_tx_desc *)&txr->tx_base[l]; +#ifndef IGB_MEDIA_RESET +/* at the same time as IGB_MEDIA_RESET was defined, the + * tx buffer descriptor was renamed, so use this to revert + * back to the old name. + */ +#define igb_tx_buf igb_tx_buffer +#endif struct igb_tx_buf *txbuf = &txr->tx_buffers[l]; int flags = ((slot->flags & NS_REPORT) || j == 0 || j == report_frequency) ? @@ -162,8 +135,6 @@ igb_netmap_txsync(struct ifnet *ifp, u_i u_int len = slot->len; if (addr == netmap_buffer_base || len > NETMAP_BUF_SIZE) { - if (do_lock) - IGB_TX_UNLOCK(txr); return netmap_ring_reinit(kring); } @@ -223,8 +194,6 @@ igb_netmap_txsync(struct ifnet *ifp, u_i /* update avail to what the kernel knows */ ring->avail = kring->nr_hwavail; - if (do_lock) - IGB_TX_UNLOCK(txr); return 0; } @@ -233,7 +202,7 @@ igb_netmap_txsync(struct ifnet *ifp, u_i * Reconcile kernel and user view of the receive ring. */ static int -igb_netmap_rxsync(struct ifnet *ifp, u_int ring_nr, int do_lock) +igb_netmap_rxsync(struct ifnet *ifp, u_int ring_nr, int flags) { struct adapter *adapter = ifp->if_softc; struct rx_ring *rxr = &adapter->rx_rings[ring_nr]; @@ -241,16 +210,13 @@ igb_netmap_rxsync(struct ifnet *ifp, u_i struct netmap_kring *kring = &na->rx_rings[ring_nr]; struct netmap_ring *ring = kring->ring; u_int j, l, n, lim = kring->nkr_num_slots - 1; - int force_update = do_lock || kring->nr_kflags & NKR_PENDINTR; + int force_update = (flags & NAF_FORCE_READ) || kring->nr_kflags & NKR_PENDINTR; u_int k = ring->cur, resvd = ring->reserved; k = ring->cur; if (k > lim) return netmap_ring_reinit(kring); - if (do_lock) - IGB_RX_LOCK(rxr); - /* XXX check sync modes */ bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); @@ -303,8 +269,6 @@ igb_netmap_rxsync(struct ifnet *ifp, u_i void *addr = PNMB(slot, &paddr); if (addr == netmap_buffer_base) { /* bad buf */ - if (do_lock) - IGB_RX_UNLOCK(rxr); return netmap_ring_reinit(kring); } @@ -332,8 +296,6 @@ igb_netmap_rxsync(struct ifnet *ifp, u_i } /* tell userspace that there are new packets */ ring->avail = kring->nr_hwavail - resvd; *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***