From owner-svn-src-all@FreeBSD.ORG Tue Oct 19 20:53:31 2010 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 403FB10656BD; Tue, 19 Oct 2010 20:53:31 +0000 (UTC) (envelope-from gibbs@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c]) by mx1.freebsd.org (Postfix) with ESMTP id 2A9BD8FC08; Tue, 19 Oct 2010 20:53:31 +0000 (UTC) Received: from svn.freebsd.org (localhost [127.0.0.1]) by svn.freebsd.org (8.14.3/8.14.3) with ESMTP id o9JKrVgw026734; Tue, 19 Oct 2010 20:53:31 GMT (envelope-from gibbs@svn.freebsd.org) Received: (from gibbs@localhost) by svn.freebsd.org (8.14.3/8.14.3/Submit) id o9JKrV0u026729; Tue, 19 Oct 2010 20:53:31 GMT (envelope-from gibbs@svn.freebsd.org) Message-Id: <201010192053.o9JKrV0u026729@svn.freebsd.org> From: "Justin T. Gibbs" Date: Tue, 19 Oct 2010 20:53:30 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Subject: svn commit: r214077 - in head/sys: conf dev/xen/balloon dev/xen/blkback dev/xen/blkfront dev/xen/control dev/xen/netfront dev/xen/xenpci i386/xen xen xen/evtchn xen/interface xen/interface/hvm xen/... X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Oct 2010 20:53:31 -0000 Author: gibbs Date: Tue Oct 19 20:53:30 2010 New Revision: 214077 URL: http://svn.freebsd.org/changeset/base/214077 Log: Improve the Xen para-virtualized device infrastructure of FreeBSD: o Add support for backend devices (e.g. blkback) o Implement extensions to the Xen para-virtualized block API to allow for larger and more outstanding I/Os. o Import a completely rewritten block back driver with support for fronting I/O to both raw devices and files. o General cleanup and documentation of the XenBus and XenStore support code. o Robustness and performance updates for the block front driver. o Fixes to the netfront driver. Sponsored by: Spectra Logic Corporation sys/xen/xenbus/init.txt: Deleted: This file explains the Linux method for XenBus device enumeration and thus does not apply to FreeBSD's NewBus approach. sys/xen/xenbus/xenbus_probe_backend.c: Deleted: Linux version of backend XenBus service routines. It was never ported to FreeBSD. See xenbusb.c, xenbusb_if.m, xenbusb_front.c xenbusb_back.c for details of FreeBSD's XenBus support. sys/xen/xenbus/xenbusvar.h: sys/xen/xenbus/xenbus_xs.c: sys/xen/xenbus/xenbus_comms.c: sys/xen/xenbus/xenbus_comms.h: sys/xen/xenstore/xenstorevar.h: sys/xen/xenstore/xenstore.c: Split XenStore into its own tree. XenBus is a software layer built on top of XenStore. The old arrangement and the naming of some structures and functions blurred these lines making it difficult to discern what services are provided by which layer and at what times these services are available (e.g. during system startup and shutdown). sys/xen/xenbus/xenbus_client.c: sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbus_probe.c: sys/xen/xenbus/xenbusb.c: sys/xen/xenbus/xenbusb.h: Split up XenBus code into methods available for use by client drivers (xenbus.c) and code used by the XenBus "bus code" to enumerate, attach, detach, and service bus drivers. sys/xen/reboot.c: sys/dev/xen/control/control.c: Add a XenBus front driver for handling shutdown, reboot, suspend, and resume events published in the XenStore. Move all PV suspend/reboot support from reboot.c into this driver. sys/xen/blkif.h: New file from Xen vendor with macros and structures used by a block back driver to service requests from a VM running a different ABI (e.g. amd64 back with i386 front). sys/conf/files: Adjust kernel build spec for new XenBus/XenStore layout and added Xen functionality. sys/dev/xen/balloon/balloon.c: sys/dev/xen/netfront/netfront.c: sys/dev/xen/blkfront/blkfront.c: sys/xen/xenbus/... sys/xen/xenstore/... o Rename XenStore APIs and structures from xenbus_* to xs_*. o Adjust to use of M_XENBUS and M_XENSTORE malloc types for allocation of objects returned by these APIs. o Adjust for changes in the bus interface for Xen drivers. sys/xen/xenbus/... sys/xen/xenstore/... Add Doxygen comments for these interfaces and the code that implements them. sys/dev/xen/blkback/blkback.c: o Rewrite the Block Back driver to attach properly via newbus, operate correctly in both PV and HVM mode regardless of domain (e.g. can be in a DOM other than 0), and to deal with the latest metadata available in XenStore for block devices. o Allow users to specify a file as a backend to blkback, in addition to character devices. Use the namei lookup of the backend path to automatically configure, based on file type, the appropriate backend method. The current implementation is limited to a single outstanding I/O at a time to file backed storage. sys/dev/xen/blkback/blkback.c: sys/xen/interface/io/blkif.h: sys/xen/blkif.h: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/blkfront/block.h: Extend the Xen blkif API: Negotiable request size and number of requests. This change extends the information recorded in the XenStore allowing block front/back devices to negotiate for optimal I/O parameters. This has been achieved without sacrificing backward compatibility with drivers that are unaware of these protocol enhancements. The extensions center around the connection protocol which now includes these additions: o The back-end device publishes its maximum supported values for, request I/O size, the number of page segments that can be associated with a request, the maximum number of requests that can be concurrently active, and the maximum number of pages that can be in the shared request ring. These values are published before the back-end enters the XenbusStateInitWait state. o The front-end waits for the back-end to enter either the InitWait or Initialize state. At this point, the front end limits it's own capabilities to the lesser of the values it finds published by the backend, it's own maximums, or, should any back-end data be missing in the store, the values supported by the original protocol. It then initializes it's internal data structures including allocation of the shared ring, publishes its maximum capabilities to the XenStore and transitions to the Initialized state. o The back-end waits for the front-end to enter the Initalized state. At this point, the back end limits it's own capabilities to the lesser of the values it finds published by the frontend, it's own maximums, or, should any front-end data be missing in the store, the values supported by the original protocol. It then initializes it's internal data structures, attaches to the shared ring and transitions to the Connected state. o The front-end waits for the back-end to enter the Connnected state, transitions itself to the connected state, and can commence I/O. Although an updated front-end driver must be aware of the back-end's InitWait state, the back-end has been coded such that it can tolerate a front-end that skips this step and transitions directly to the Initialized state without waiting for the back-end. sys/xen/interface/io/blkif.h: o Increase BLKIF_MAX_SEGMENTS_PER_REQUEST to 255. This is the maximum number possible without changing the blkif request header structure (nr_segs is a uint8_t). o Add two new constants: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK, and BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK. These respectively indicate the number of segments that can fit in the first ring-buffer entry of a request, and for each subsequent (sg element only) ring-buffer entry associated with the "header" ring-buffer entry of the request. o Add the blkif_request_segment_t typedef for segment elements. o Add the BLKRING_GET_SG_REQUEST() macro which wraps the RING_GET_REQUEST() macro and returns a properly cast pointer to an array of blkif_request_segment_ts. o Add the BLKIF_SEGS_TO_BLOCKS() macro which calculates the number of ring entries that will be consumed by a blkif request with the given number of segments. sys/xen/blkif.h: o Update for changes in interface/io/blkif.h macros. o Update the BLKIF_MAX_RING_REQUESTS() macro to take the ring size as an argument to allow this calculation on multi-page rings. o Add a companion macro to BLKIF_MAX_RING_REQUESTS(), BLKIF_RING_PAGES(). This macro determines the number of ring pages required in order to support a ring with the supplied number of request blocks. sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/blkfront/block.h: o Negotiate with the other-end with the following limits: Reqeust Size: MAXPHYS Max Segments: (MAXPHYS/PAGE_SIZE) + 1 Max Requests: 256 Max Ring Pages: Sufficient to support Max Requests with Max Segments. o Dynamically allocate request pools and segemnts-per-request. o Update ring allocation/attachment code to support a multi-page shared ring. o Update routines that access the shared ring to handle multi-block requests. sys/dev/xen/blkfront/blkfront.c: o Track blkfront allocations in a blkfront driver specific malloc pool. o Strip out XenStore transaction retry logic in the connection code. Transactions only need to be used when the update to multiple XenStore nodes must be atomic. That is not the case here. o Fully disable blkif_resume() until it can be fixed properly (it didn't work before this change). o Destroy bus-dma objects during device instance tear-down. o Properly handle backend devices with powef-of-2 sector sizes larger than 512b. sys/dev/xen/blkback/blkback.c: Advertise support for and implement the BLKIF_OP_WRITE_BARRIER and BLKIF_OP_FLUSH_DISKCACHE blkif opcodes using BIO_FLUSH and the BIO_ORDERED attribute of bios. sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/blkfront/block.h: Fix various bugs in blkfront. o gnttab_alloc_grant_references() returns 0 for success and non-zero for failure. The check for < 0 is a leftover Linuxism. o When we negotiate with blkback and have to reduce some of our capabilities, print out the original and reduced capability before changing the local capability. So the user now gets the correct information. o Fix blkif_restart_queue_callback() formatting. Make sure we hold the mutex in that function before calling xb_startio(). o Fix a couple of KASSERT()s. o Fix a check in the xb_remove_* macro to be a little more specific. sys/xen/gnttab.h: sys/xen/gnttab.c: Define GNTTAB_LIST_END publicly as GRANT_REF_INVALID. sys/dev/xen/netfront/netfront.c: Use GRANT_REF_INVALID instead of driver private definitions of the same constant. sys/xen/gnttab.h: sys/xen/gnttab.c: Add the gnttab_end_foreign_access_references() API. This API allows a client to batch the release of an array of grant references, instead of coding a private for loop. The implementation takes advantage of this batching to reduce lock overhead to one acquisition and release per-batch instead of per-freed grant reference. While here, reduce the duration the gnttab_list_lock is held during gnttab_free_grant_references() operations. The search to find the tail of the incoming free list does not rely on global state and so can be performed without holding the lock. sys/dev/xen/xenpci/evtchn.c: sys/dev/xen/evtchn/evtchn.c: sys/xen/xen_intr.h: o Implement the bind_interdomain_evtchn_to_irqhandler API for HVM mode. This allows an HVM domain to serve back end devices to other domains. This API is already implemented for PV mode. o Synchronize the API between HVM and PV. sys/dev/xen/xenpci/xenpci.c: o Scan the full region of CPUID space in which the Xen VMM interface may be implemented. On systems using SuSE as a Dom0 where the Viridian API is also exported, the VMM interface is above the region we used to search. o Pass through bus_alloc_resource() calls so that XenBus drivers attaching on an HVM system can allocate unused physical address space from the nexus. The block back driver makes use of this facility. sys/i386/xen/xen_machdep.c: Use the correct type for accessing the statically mapped xenstore metadata. sys/xen/interface/hvm/params.h: sys/xen/xenstore/xenstore.c: Move hvm_get_parameter() to the correct global header file instead of as a private method to the XenStore. sys/xen/interface/io/protocols.h: Sync with vendor. sys/xeninterface/io/ring.h: Add macro for calculating the number of ring pages needed for an N deep ring. To avoid duplication within the macros, create and use the new __RING_HEADER_SIZE() macro. This macro calculates the size of the ring book keeping struct (producer/consumer indexes, etc.) that resides at the head of the ring. Add the __RING_PAGES() macro which calculates the number of shared ring pages required to support a ring with the given number of requests. These APIs are used to support the multi-page ring version of the Xen block API. sys/xeninterface/io/xenbus.h: Add Comments. sys/xen/xenbus/... o Refactor the FreeBSD XenBus support code to allow for both front and backend device attachments. o Make use of new config_intr_hook capabilities to allow front and back devices to be probed/attached in parallel. o Fix bugs in probe/attach state machine that could cause the system to hang when confronted with a failure either in the local domain or in a remote domain to which one of our driver instances is attaching. o Publish all required state to the XenStore on device detach and failure. The majority of the missing functionality was for serving as a back end since the typical "hot-plug" scripts in Dom0 don't handle the case of cleaning up for a "service domain" that is not itself. o Add dynamic sysctl nodes exposing the generic ivars of XenBus devices. o Add doxygen style comments to the majority of the code. o Cleanup types, formatting, etc. sys/xen/xenbus/xenbusb.c: Common code used by both front and back XenBus busses. sys/xen/xenbus/xenbusb_if.m: Method definitions for a XenBus bus. sys/xen/xenbus/xenbusb_front.c: sys/xen/xenbus/xenbusb_back.c: XenBus bus specialization for front and back devices. MFC after: 1 month Added: head/sys/dev/xen/control/ head/sys/dev/xen/control/control.c (contents, props changed) head/sys/xen/blkif.h (contents, props changed) head/sys/xen/xenbus/xenbus.c - copied, changed from r212768, head/sys/xen/xenbus/xenbus_client.c head/sys/xen/xenbus/xenbusb.c - copied, changed from r212768, head/sys/xen/xenbus/xenbus_probe.c head/sys/xen/xenbus/xenbusb.h - copied, changed from r212768, head/sys/xen/xenbus/xenbus_probe.c head/sys/xen/xenbus/xenbusb_back.c (contents, props changed) head/sys/xen/xenbus/xenbusb_front.c (contents, props changed) head/sys/xen/xenbus/xenbusb_if.m (contents, props changed) head/sys/xen/xenstore/ head/sys/xen/xenstore/xenstore.c - copied, changed from r212768, head/sys/xen/xenbus/xenbus_xs.c head/sys/xen/xenstore/xenstore_dev.c - copied, changed from r212768, head/sys/xen/xenbus/xenbus_dev.c head/sys/xen/xenstore/xenstore_internal.h (contents, props changed) head/sys/xen/xenstore/xenstorevar.h - copied, changed from r212768, head/sys/xen/xenbus/xenbusvar.h Deleted: head/sys/xen/reboot.c head/sys/xen/xenbus/init.txt head/sys/xen/xenbus/xenbus_client.c head/sys/xen/xenbus/xenbus_comms.c head/sys/xen/xenbus/xenbus_comms.h head/sys/xen/xenbus/xenbus_dev.c head/sys/xen/xenbus/xenbus_probe.c head/sys/xen/xenbus/xenbus_probe_backend.c head/sys/xen/xenbus/xenbus_xs.c Modified: head/sys/conf/files head/sys/dev/xen/balloon/balloon.c head/sys/dev/xen/blkback/blkback.c head/sys/dev/xen/blkfront/blkfront.c head/sys/dev/xen/blkfront/block.h head/sys/dev/xen/netfront/netfront.c head/sys/dev/xen/xenpci/evtchn.c head/sys/dev/xen/xenpci/xenpci.c head/sys/i386/xen/xen_machdep.c head/sys/xen/evtchn/evtchn.c head/sys/xen/gnttab.c head/sys/xen/gnttab.h head/sys/xen/interface/grant_table.h head/sys/xen/interface/hvm/params.h head/sys/xen/interface/io/blkif.h head/sys/xen/interface/io/protocols.h head/sys/xen/interface/io/ring.h head/sys/xen/interface/io/xenbus.h head/sys/xen/xen_intr.h head/sys/xen/xenbus/xenbus_if.m head/sys/xen/xenbus/xenbusvar.h Modified: head/sys/conf/files ============================================================================== --- head/sys/conf/files Tue Oct 19 20:38:21 2010 (r214076) +++ head/sys/conf/files Tue Oct 19 20:53:30 2010 (r214077) @@ -3008,19 +3008,20 @@ xen/gnttab.c optional xen | xenhvm xen/features.c optional xen | xenhvm xen/evtchn/evtchn.c optional xen xen/evtchn/evtchn_dev.c optional xen | xenhvm -xen/reboot.c optional xen -xen/xenbus/xenbus_client.c optional xen | xenhvm -xen/xenbus/xenbus_comms.c optional xen | xenhvm -xen/xenbus/xenbus_dev.c optional xen | xenhvm xen/xenbus/xenbus_if.m optional xen | xenhvm -xen/xenbus/xenbus_probe.c optional xen | xenhvm -#xen/xenbus/xenbus_probe_backend.c optional xen -xen/xenbus/xenbus_xs.c optional xen | xenhvm +xen/xenbus/xenbus.c optional xen | xenhvm +xen/xenbus/xenbusb_if.m optional xen | xenhvm +xen/xenbus/xenbusb.c optional xen | xenhvm +xen/xenbus/xenbusb_front.c optional xen | xenhvm +xen/xenbus/xenbusb_back.c optional xen | xenhvm +xen/xenstore/xenstore.c optional xen | xenhvm +xen/xenstore/xenstore_dev.c optional xen | xenhvm dev/xen/balloon/balloon.c optional xen | xenhvm +dev/xen/blkfront/blkfront.c optional xen | xenhvm +dev/xen/blkback/blkback.c optional xen | xenhvm dev/xen/console/console.c optional xen dev/xen/console/xencons_ring.c optional xen -dev/xen/blkfront/blkfront.c optional xen | xenhvm +dev/xen/control/control.c optional xen | xenhvm dev/xen/netfront/netfront.c optional xen | xenhvm dev/xen/xenpci/xenpci.c optional xenpci dev/xen/xenpci/evtchn.c optional xenpci -dev/xen/xenpci/machine_reboot.c optional xenpci Modified: head/sys/dev/xen/balloon/balloon.c ============================================================================== --- head/sys/dev/xen/balloon/balloon.c Tue Oct 19 20:38:21 2010 (r214076) +++ head/sys/dev/xen/balloon/balloon.c Tue Oct 19 20:53:30 2010 (r214077) @@ -44,7 +44,7 @@ __FBSDID("$FreeBSD$"); #include #include #include -#include +#include #include #include @@ -406,20 +406,20 @@ set_new_target(unsigned long target) wakeup(balloon_process); } -static struct xenbus_watch target_watch = +static struct xs_watch target_watch = { .node = "memory/target" }; /* React to a change in the target key */ static void -watch_target(struct xenbus_watch *watch, +watch_target(struct xs_watch *watch, const char **vec, unsigned int len) { unsigned long long new_target; int err; - err = xenbus_scanf(XBT_NIL, "memory", "target", NULL, + err = xs_scanf(XST_NIL, "memory", "target", NULL, "%llu", &new_target); if (err) { /* This is ok (for domain0 at least) - so just return */ @@ -438,7 +438,7 @@ balloon_init_watcher(void *arg) { int err; - err = register_xenbus_watch(&target_watch); + err = xs_register_watch(&target_watch); if (err) printf("Failed to set balloon watcher\n"); Modified: head/sys/dev/xen/blkback/blkback.c ============================================================================== --- head/sys/dev/xen/blkback/blkback.c Tue Oct 19 20:38:21 2010 (r214076) +++ head/sys/dev/xen/blkback/blkback.c Tue Oct 19 20:53:30 2010 (r214077) @@ -1,1055 +1,1919 @@ -/* - * Copyright (c) 2006, Cisco Systems, Inc. +/*- + * Copyright (c) 2009-2010 Spectra Logic Corporation * All rights reserved. * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions, and the following disclaimer, + * without modification. + * 2. Redistributions in binary form must reproduce at minimum a disclaimer + * substantially similar to the "NO WARRANTY" disclaimer below + * ("Disclaimer") and any redistribution must be conditioned upon + * including a substantially similar Disclaimer requirement for further + * binary redistribution. * - * 1. Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * 3. Neither the name of Cisco Systems, Inc. nor the names of its contributors - * may be used to endorse or promote products derived from this software - * without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" - * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE - * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE - * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR - * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF - * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS - * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN - * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) - * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE - * POSSIBILITY OF SUCH DAMAGE. + * NO WARRANTY + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGES. + * + * Authors: Justin T. Gibbs (Spectra Logic Corporation) + * Ken Merry (Spectra Logic Corporation) */ - #include __FBSDID("$FreeBSD$"); +/** + * \file blkback.c + * + * \brief Device driver supporting the vending of block storage from + * a FreeBSD domain to other domains. + */ + #include #include -#include -#include #include -#include -#include -#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include #include #include -#include +#include +#include +#include #include -#include -#include -#include - -#include -#include -#include +#include #include +#include +#include + +#include #include #include -#include -#include -#include -#include -#include -#include -#include -#include -#include +#include +#include +#include +#include + +#include +#include + +#include + +/*--------------------------- Compile-time Tunables --------------------------*/ +/** + * The maximum number of outstanding request blocks (request headers plus + * additional segment blocks) we will allow in a negotiated block-front/back + * communication channel. + */ +#define XBB_MAX_REQUESTS 256 + +/** + * \brief Define to force all I/O to be performed on memory owned by the + * backend device, with a copy-in/out to the remote domain's memory. + * + * \note This option is currently required when this driver's domain is + * operating in HVM mode on a system using an IOMMU. + * + * This driver uses Xen's grant table API to gain access to the memory of + * the remote domains it serves. When our domain is operating in PV mode, + * the grant table mechanism directly updates our domain's page table entries + * to point to the physical pages of the remote domain. This scheme guarantees + * that blkback and the backing devices it uses can safely perform DMA + * operations to satisfy requests. In HVM mode, Xen may use a HW IOMMU to + * insure that our domain cannot DMA to pages owned by another domain. As + * of Xen 4.0, IOMMU mappings for HVM guests are not updated via the grant + * table API. For this reason, in HVM mode, we must bounce all requests into + * memory that is mapped into our domain at domain startup and thus has + * valid IOMMU mappings. + */ +#define XBB_USE_BOUNCE_BUFFERS + +/** + * \brief Define to enable rudimentary request logging to the console. + */ +#undef XBB_DEBUG +/*---------------------------------- Macros ----------------------------------*/ +/** + * Custom malloc type for all driver allocations. + */ +MALLOC_DEFINE(M_XENBLOCKBACK, "xbbd", "Xen Block Back Driver Data"); -#if XEN_BLKBACK_DEBUG +#ifdef XBB_DEBUG #define DPRINTF(fmt, args...) \ - printf("blkback (%s:%d): " fmt, __FUNCTION__, __LINE__, ##args) + printf("xbb(%s:%d): " fmt, __FUNCTION__, __LINE__, ##args) #else -#define DPRINTF(fmt, args...) ((void)0) +#define DPRINTF(fmt, args...) do {} while(0) #endif -#define WPRINTF(fmt, args...) \ - printf("blkback (%s:%d): " fmt, __FUNCTION__, __LINE__, ##args) +/** + * The maximum mapped region size per request we will allow in a negotiated + * block-front/back communication channel. + */ +#define XBB_MAX_REQUEST_SIZE \ + MIN(MAXPHYS, BLKIF_MAX_SEGMENTS_PER_REQUEST * PAGE_SIZE) -#define BLKBACK_INVALID_HANDLE (~0) +/** + * The maximum number of segments (within a request header and accompanying + * segment blocks) per request we will allow in a negotiated block-front/back + * communication channel. + */ +#define XBB_MAX_SEGMENTS_PER_REQUEST \ + (MIN(UIO_MAXIOV, \ + MIN(BLKIF_MAX_SEGMENTS_PER_REQUEST, \ + (XBB_MAX_REQUEST_SIZE / PAGE_SIZE) + 1))) + +/** + * The maximum number of shared memory ring pages we will allow in a + * negotiated block-front/back communication channel. Allow enough + * ring space for all requests to be XBB_MAX_REQUEST_SIZE'd. + */ +#define XBB_MAX_RING_PAGES \ + BLKIF_RING_PAGES(BLKIF_SEGS_TO_BLOCKS(XBB_MAX_SEGMENTS_PER_REQUEST) \ + * XBB_MAX_REQUESTS) + +/*--------------------------- Forward Declarations ---------------------------*/ +struct xbb_softc; + +static void xbb_attach_failed(struct xbb_softc *xbb, int err, const char *fmt, + ...) __attribute__((format(printf, 3, 4))); +static int xbb_shutdown(struct xbb_softc *xbb); +static int xbb_detach(device_t dev); + +/*------------------------------ Data Structures -----------------------------*/ +/** + * \brief Object tracking an in-flight I/O from a Xen VBD consumer. + */ +struct xbb_xen_req { + /** + * Linked list links used to aggregate idle request in the + * request free pool (xbb->request_free_slist). + */ + SLIST_ENTRY(xbb_xen_req) links; + + /** + * Back reference to the parent block back instance for this + * request. Used during bio_done handling. + */ + struct xbb_softc *xbb; + + /** + * The remote domain's identifier for this I/O request. + */ + uint64_t id; + + /** + * Kernel virtual address space reserved for this request + * structure and used to map the remote domain's pages for + * this I/O, into our domain's address space. + */ + uint8_t *kva; + +#ifdef XBB_USE_BOUNCE_BUFFERS + /** + * Pre-allocated domain local memory used to proxy remote + * domain memory during I/O operations. + */ + uint8_t *bounce; +#endif -struct ring_ref { - vm_offset_t va; - grant_handle_t handle; - uint64_t bus_addr; + /** + * Base, psuedo-physical address, corresponding to the start + * of this request's kva region. + */ + uint64_t gnt_base; + + /** + * The number of pages currently mapped for this request. + */ + int nr_pages; + + /** + * The number of 512 byte sectors comprising this requests. + */ + int nr_512b_sectors; + + /** + * The number of struct bio requests still outstanding for this + * request on the backend device. This field is only used for + * device (rather than file) backed I/O. + */ + int pendcnt; + + /** + * BLKIF_OP code for this request. + */ + int operation; + + /** + * BLKIF_RSP status code for this request. + * + * This field allows an error status to be recorded even if the + * delivery of this status must be deferred. Deferred reporting + * is necessary, for example, when an error is detected during + * completion processing of one bio when other bios for this + * request are still outstanding. + */ + int status; + + /** + * Device statistics request ordering type (ordered or simple). + */ + devstat_tag_type ds_tag_type; + + /** + * Device statistics request type (read, write, no_data). + */ + devstat_trans_flags ds_trans_type; + + /** + * The start time for this request. + */ + struct bintime ds_t0; + + /** + * Array of grant handles (one per page) used to map this request. + */ + grant_handle_t *gnt_handles; }; +SLIST_HEAD(xbb_xen_req_slist, xbb_xen_req); -typedef struct blkback_info { - - /* Schedule lists */ - STAILQ_ENTRY(blkback_info) next_req; - int on_req_sched_list; - - struct xenbus_device *xdev; - XenbusState frontend_state; - - domid_t domid; - - int state; - int ring_connected; - struct ring_ref rr; - blkif_back_ring_t ring; - evtchn_port_t evtchn; - int irq; - void *irq_cookie; - - int ref_cnt; - - int handle; - char *mode; - char *type; - char *dev_name; - - struct vnode *vn; - struct cdev *cdev; - struct cdevsw *csw; - u_int sector_size; - int sector_size_shift; - off_t media_size; - u_int media_num_sectors; - int major; - int minor; - int read_only; - - struct mtx blk_ring_lock; - - device_t ndev; - - /* Stats */ - int st_rd_req; - int st_wr_req; - int st_oo_req; - int st_err_req; -} blkif_t; - -/* - * These are rather arbitrary. They are fairly large because adjacent requests - * pulled from a communication ring are quite likely to end up being part of - * the same scatter/gather request at the disc. - * - * ** TRY INCREASING 'blkif_reqs' IF WRITE SPEEDS SEEM TOO LOW ** - * - * This will increase the chances of being able to write whole tracks. - * 64 should be enough to keep us competitive with Linux. +/** + * \brief Configuration data for the shared memory request ring + * used to communicate with the front-end client of this + * this driver. */ -static int blkif_reqs = 64; -TUNABLE_INT("xen.vbd.blkif_reqs", &blkif_reqs); +struct xbb_ring_config { + /** KVA address where ring memory is mapped. */ + vm_offset_t va; + + /** The pseudo-physical address where ring memory is mapped.*/ + uint64_t gnt_addr; + + /** + * Grant table handles, one per-ring page, returned by the + * hyperpervisor upon mapping of the ring and required to + * unmap it when a connection is torn down. + */ + grant_handle_t handle[XBB_MAX_RING_PAGES]; + + /** + * The device bus address returned by the hypervisor when + * mapping the ring and required to unmap it when a connection + * is torn down. + */ + uint64_t bus_addr[XBB_MAX_RING_PAGES]; + + /** The number of ring pages mapped for the current connection. */ + u_int ring_pages; + + /** + * The grant references, one per-ring page, supplied by the + * front-end, allowing us to reference the ring pages in the + * front-end's domain and to map these pages into our own domain. + */ + grant_ref_t ring_ref[XBB_MAX_RING_PAGES]; -static int mmap_pages; + /** The interrupt driven even channel used to signal ring events. */ + evtchn_port_t evtchn; +}; -/* - * Each outstanding request that we've passed to the lower device layers has a - * 'pending_req' allocated to it. Each buffer_head that completes decrements - * the pendcnt towards zero. When it hits zero, the specified domain has a - * response queued for it, with the saved 'id' passed back. +/** + * Per-instance connection state flags. */ -typedef struct pending_req { - blkif_t *blkif; - uint64_t id; - int nr_pages; - int pendcnt; - unsigned short operation; - int status; - STAILQ_ENTRY(pending_req) free_list; -} pending_req_t; - -static pending_req_t *pending_reqs; -static STAILQ_HEAD(pending_reqs_list, pending_req) pending_free = - STAILQ_HEAD_INITIALIZER(pending_free); -static struct mtx pending_free_lock; - -static STAILQ_HEAD(blkback_req_sched_list, blkback_info) req_sched_list = - STAILQ_HEAD_INITIALIZER(req_sched_list); -static struct mtx req_sched_list_lock; - -static unsigned long mmap_vstart; -static unsigned long *pending_vaddrs; -static grant_handle_t *pending_grant_handles; - -static struct task blk_req_task; - -/* Protos */ -static void disconnect_ring(blkif_t *blkif); -static int vbd_add_dev(struct xenbus_device *xdev); - -static inline int vaddr_pagenr(pending_req_t *req, int seg) +typedef enum { - return (req - pending_reqs) * BLKIF_MAX_SEGMENTS_PER_REQUEST + seg; -} + /** + * The front-end requested a read-only mount of the + * back-end device/file. + */ + XBBF_READ_ONLY = 0x01, + + /** Communication with the front-end has been established. */ + XBBF_RING_CONNECTED = 0x02, + + /** + * Front-end requests exist in the ring and are waiting for + * xbb_xen_req objects to free up. + */ + XBBF_RESOURCE_SHORTAGE = 0x04, + + /** Connection teardown in progress. */ + XBBF_SHUTDOWN = 0x08 +} xbb_flag_t; + +/** Backend device type. */ +typedef enum { + /** Backend type unknown. */ + XBB_TYPE_NONE = 0x00, + + /** + * Backend type disk (access via cdev switch + * strategy routine). + */ + XBB_TYPE_DISK = 0x01, + + /** Backend type file (access vnode operations.). */ + XBB_TYPE_FILE = 0x02 +} xbb_type; + +/** + * \brief Structure used to memoize information about a per-request + * scatter-gather list. + * + * The chief benefit of using this data structure is it avoids having + * to reparse the possibly discontiguous S/G list in the original + * request. Due to the way that the mapping of the memory backing an + * I/O transaction is handled by Xen, a second pass is unavoidable. + * At least this way the second walk is a simple array traversal. + * + * \note A single Scatter/Gather element in the block interface covers + * at most 1 machine page. In this context a sector (blkif + * nomenclature, not what I'd choose) is a 512b aligned unit + * of mapping within the machine page referenced by an S/G + * element. + */ +struct xbb_sg { + /** The number of 512b data chunks mapped in this S/G element. */ + int16_t nsect; + + /** + * The index (0 based) of the first 512b data chunk mapped + * in this S/G element. + */ + uint8_t first_sect; + + /** + * The index (0 based) of the last 512b data chunk mapped + * in this S/G element. + */ + uint8_t last_sect; +}; -static inline unsigned long vaddr(pending_req_t *req, int seg) -{ - return pending_vaddrs[vaddr_pagenr(req, seg)]; -} +/** + * Character device backend specific configuration data. + */ +struct xbb_dev_data { + /** Cdev used for device backend access. */ + struct cdev *cdev; -#define pending_handle(_req, _seg) \ - (pending_grant_handles[vaddr_pagenr(_req, _seg)]) + /** Cdev switch used for device backend access. */ + struct cdevsw *csw; -static unsigned long -alloc_empty_page_range(unsigned long nr_pages) -{ - void *pages; - int i = 0, j = 0; - multicall_entry_t mcl[17]; - unsigned long mfn_list[16]; - struct xen_memory_reservation reservation = { - .extent_start = mfn_list, - .nr_extents = 0, - .address_bits = 0, - .extent_order = 0, - .domid = DOMID_SELF - }; + /** Used to hold a reference on opened cdev backend devices. */ + int dev_ref; +}; - pages = malloc(nr_pages*PAGE_SIZE, M_DEVBUF, M_NOWAIT); - if (pages == NULL) - return 0; +/** + * File backend specific configuration data. + */ +struct xbb_file_data { + /** Credentials to use for vnode backed (file based) I/O. */ + struct ucred *cred; + + /** + * \brief Array of io vectors used to process file based I/O. + * + * Only a single file based request is outstanding per-xbb instance, + * so we only need one of these. + */ + struct iovec xiovecs[XBB_MAX_SEGMENTS_PER_REQUEST]; +#ifdef XBB_USE_BOUNCE_BUFFERS + + /** + * \brief Array of io vectors used to handle bouncing of file reads. + * + * Vnode operations are free to modify uio data during their + * exectuion. In the case of a read with bounce buffering active, + * we need some of the data from the original uio in order to + * bounce-out the read data. This array serves as the temporary + * storage for this saved data. + */ + struct iovec saved_xiovecs[XBB_MAX_SEGMENTS_PER_REQUEST]; + + /** + * \brief Array of memoized bounce buffer kva offsets used + * in the file based backend. + * + * Due to the way that the mapping of the memory backing an + * I/O transaction is handled by Xen, a second pass through + * the request sg elements is unavoidable. We memoize the computed + * bounce address here to reduce the cost of the second walk. + */ + void *xiovecs_vaddr[XBB_MAX_SEGMENTS_PER_REQUEST]; +#endif /* XBB_USE_BOUNCE_BUFFERS */ +}; - memset(mcl, 0, sizeof(mcl)); +/** + * Collection of backend type specific data. + */ +union xbb_backend_data { + struct xbb_dev_data dev; + struct xbb_file_data file; +}; - while (i < nr_pages) { - unsigned long va = (unsigned long)pages + (i++ * PAGE_SIZE); +/** + * Function signature of backend specific I/O handlers. + */ +typedef int (*xbb_dispatch_t)(struct xbb_softc *xbb, blkif_request_t *ring_req, + struct xbb_xen_req *req, int nseg, + int operation, int flags); - mcl[j].op = __HYPERVISOR_update_va_mapping; - mcl[j].args[0] = va; +/** + * Per-instance configuration data. + */ +struct xbb_softc { - mfn_list[j++] = vtomach(va) >> PAGE_SHIFT; + /** + * Task-queue used to process I/O requests. + */ + struct taskqueue *io_taskqueue; + + /** + * Single "run the request queue" task enqueued + * on io_taskqueue. + */ + struct task io_task; + + /** Device type for this instance. */ + xbb_type device_type; + + /** NewBus device corresponding to this instance. */ + device_t dev; + + /** Backend specific dispatch routine for this instance. */ + xbb_dispatch_t dispatch_io; + + /** The number of requests outstanding on the backend device/file. */ + u_int active_request_count; + + /** Free pool of request tracking structures. */ + struct xbb_xen_req_slist request_free_slist; + + /** Array, sized at connection time, of request tracking structures. */ + struct xbb_xen_req *requests; + + /** + * Global pool of kva used for mapping remote domain ring + * and I/O transaction data. + */ + vm_offset_t kva; + + /** Psuedo-physical address corresponding to kva. */ + uint64_t gnt_base_addr; + + /** The size of the global kva pool. */ + int kva_size; + + /** + * \brief Cached value of the front-end's domain id. + * + * This value is used at once for each mapped page in + * a transaction. We cache it to avoid incuring the + * cost of an ivar access every time this is needed. + */ + domid_t otherend_id; + + /** + * \brief The blkif protocol abi in effect. + * + * There are situations where the back and front ends can + * have a different, native abi (e.g. intel x86_64 and + * 32bit x86 domains on the same machine). The back-end + * always accomodates the front-end's native abi. That + * value is pulled from the XenStore and recorded here. + */ + int abi; + + /** + * \brief The maximum number of requests allowed to be in + * flight at a time. + * + * This value is negotiated via the XenStore. + */ + uint32_t max_requests; + + /** + * \brief The maximum number of segments (1 page per segment) + * that can be mapped by a request. + * + * This value is negotiated via the XenStore. + */ + uint32_t max_request_segments; + + /** + * The maximum size of any request to this back-end + * device. + * + * This value is negotiated via the XenStore. + */ + uint32_t max_request_size; + + /** Various configuration and state bit flags. */ + xbb_flag_t flags; + + /** Ring mapping and interrupt configuration data. */ + struct xbb_ring_config ring_config; + + /** Runtime, cross-abi safe, structures for ring access. */ + blkif_back_rings_t rings; + + /** IRQ mapping for the communication ring event channel. */ + int irq; + + /** + * \brief Backend access mode flags (e.g. write, or read-only). + * + * This value is passed to us by the front-end via the XenStore. + */ + char *dev_mode; + + /** + * \brief Backend device type (e.g. "disk", "cdrom", "floppy"). + * + * This value is passed to us by the front-end via the XenStore. + * Currently unused. + */ + char *dev_type; + + /** + * \brief Backend device/file identifier. + * + * This value is passed to us by the front-end via the XenStore. + * We expect this to be a POSIX path indicating the file or + * device to open. + */ + char *dev_name; + + /** + * Vnode corresponding to the backend device node or file + * we are acessing. + */ + struct vnode *vn; + + union xbb_backend_data backend; + /** The native sector size of the backend. */ + u_int sector_size; + + /** log2 of sector_size. */ + u_int sector_size_shift; + + /** Size in bytes of the backend device or file. */ + off_t media_size; + + /** + * \brief media_size expressed in terms of the backend native + * sector size. + * + * (e.g. xbb->media_size >> xbb->sector_size_shift). + */ + uint64_t media_num_sectors; + + /** + * \brief Array of memoized scatter gather data computed during the + * conversion of blkif ring requests to internal xbb_xen_req + * structures. + * + * Ring processing is serialized so we only need one of these. + */ + struct xbb_sg xbb_sgs[XBB_MAX_SEGMENTS_PER_REQUEST]; + + /** Mutex protecting per-instance data. */ + struct mtx lock; + +#ifdef XENHVM + /** + * Resource representing allocated physical address space + * associated with our per-instance kva region. + */ + struct resource *pseudo_phys_res; - xen_phys_machine[(vtophys(va) >> PAGE_SHIFT)] = INVALID_P2M_ENTRY; + /** Resource id for allocated physical address space. */ + int pseudo_phys_res_id; +#endif - if (j == 16 || i == nr_pages) { - mcl[j-1].args[MULTI_UVMFLAGS_INDEX] = UVMF_TLB_FLUSH|UVMF_LOCAL; + /** I/O statistics. */ + struct devstat *xbb_stats; +}; - reservation.nr_extents = j; +/*---------------------------- Request Processing ----------------------------*/ +/** + * Allocate an internal transaction tracking structure from the free pool. + * + * \param xbb Per-instance xbb configuration structure. + * + * \return On success, a pointer to the allocated xbb_xen_req structure. + * Otherwise NULL. + */ +static inline struct xbb_xen_req * +xbb_get_req(struct xbb_softc *xbb) +{ + struct xbb_xen_req *req; - mcl[j].op = __HYPERVISOR_memory_op; - mcl[j].args[0] = XENMEM_decrease_reservation; - mcl[j].args[1] = (unsigned long)&reservation; - - (void)HYPERVISOR_multicall(mcl, j+1); + req = NULL; + mtx_lock(&xbb->lock); - mcl[j-1].args[MULTI_UVMFLAGS_INDEX] = 0; - j = 0; + /* + * Do not allow new requests to be allocated while we + * are shutting down. + */ + if ((xbb->flags & XBBF_SHUTDOWN) == 0) { + if ((req = SLIST_FIRST(&xbb->request_free_slist)) != NULL) { + SLIST_REMOVE_HEAD(&xbb->request_free_slist, links); + xbb->active_request_count++; + } else { + xbb->flags |= XBBF_RESOURCE_SHORTAGE; } } - - return (unsigned long)pages; -} - -static pending_req_t * -alloc_req(void) -{ - pending_req_t *req; - mtx_lock(&pending_free_lock); - if ((req = STAILQ_FIRST(&pending_free))) { - STAILQ_REMOVE(&pending_free, req, pending_req, free_list); - STAILQ_NEXT(req, free_list) = NULL; - } - mtx_unlock(&pending_free_lock); - return req; + mtx_unlock(&xbb->lock); + return (req); } -static void -free_req(pending_req_t *req) +/** + * Return an allocated transaction tracking structure to the free pool. + * + * \param xbb Per-instance xbb configuration structure. + * \param req The request structure to free. + */ +static inline void +xbb_release_req(struct xbb_softc *xbb, struct xbb_xen_req *req) { - int was_empty; - - mtx_lock(&pending_free_lock); - was_empty = STAILQ_EMPTY(&pending_free); - STAILQ_INSERT_TAIL(&pending_free, req, free_list); - mtx_unlock(&pending_free_lock); - if (was_empty) - taskqueue_enqueue(taskqueue_swi, &blk_req_task); -} + int wake_thread; -static void -fast_flush_area(pending_req_t *req) -{ - struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST]; - unsigned int i, invcount = 0; - grant_handle_t handle; - int ret; + mtx_lock(&xbb->lock); + wake_thread = xbb->flags & XBBF_RESOURCE_SHORTAGE; + xbb->flags &= ~XBBF_RESOURCE_SHORTAGE; + SLIST_INSERT_HEAD(&xbb->request_free_slist, req, links); + xbb->active_request_count--; - for (i = 0; i < req->nr_pages; i++) { - handle = pending_handle(req, i); - if (handle == BLKBACK_INVALID_HANDLE) - continue; - unmap[invcount].host_addr = vaddr(req, i); - unmap[invcount].dev_bus_addr = 0; - unmap[invcount].handle = handle; - pending_handle(req, i) = BLKBACK_INVALID_HANDLE; - invcount++; + if ((xbb->flags & XBBF_SHUTDOWN) != 0) { + /* + * Shutdown is in progress. See if we can + * progress further now that one more request + * has completed and been returned to the + * free pool. + */ + xbb_shutdown(xbb); } + mtx_unlock(&xbb->lock); - ret = HYPERVISOR_grant_table_op( - GNTTABOP_unmap_grant_ref, unmap, invcount); - PANIC_IF(ret); + if (wake_thread != 0) *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***