From owner-svn-src-projects@FreeBSD.ORG  Fri Mar  6 00:24:22 2015
Return-Path: <owner-svn-src-projects@FreeBSD.ORG>
Delivered-To: svn-src-projects@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 816D1449;
 Fri,  6 Mar 2015 00:24:22 +0000 (UTC)
Received: from svn.freebsd.org (svn.freebsd.org
 [IPv6:2001:1900:2254:2068::e6a:0])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6B4B3F3A;
 Fri,  6 Mar 2015 00:24:22 +0000 (UTC)
Received: from svn.freebsd.org ([127.0.1.70])
 by svn.freebsd.org (8.14.9/8.14.9) with ESMTP id t260OMjv077353;
 Fri, 6 Mar 2015 00:24:22 GMT (envelope-from imp@FreeBSD.org)
Received: (from imp@localhost)
 by svn.freebsd.org (8.14.9/8.14.9/Submit) id t260OLFH077350;
 Fri, 6 Mar 2015 00:24:21 GMT (envelope-from imp@FreeBSD.org)
Message-Id: <201503060024.t260OLFH077350@svn.freebsd.org>
X-Authentication-Warning: svn.freebsd.org: imp set sender to imp@FreeBSD.org
 using -f
From: Warner Losh <imp@FreeBSD.org>
Date: Fri, 6 Mar 2015 00:24:21 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-projects@freebsd.org
Subject: svn commit: r279682 - in projects/iosched/sys: cam conf
X-SVN-Group: projects
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-BeenThere: svn-src-projects@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: "SVN commit messages for the src &quot; projects&quot;
 tree" <svn-src-projects.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-projects>, 
 <mailto:svn-src-projects-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-projects/>
List-Post: <mailto:svn-src-projects@freebsd.org>
List-Help: <mailto:svn-src-projects-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-projects>, 
 <mailto:svn-src-projects-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Mar 2015 00:24:22 -0000

Author: imp
Date: Fri Mar  6 00:24:21 2015
New Revision: 279682
URL: https://svnweb.freebsd.org/changeset/base/279682

Log:
  Implement the limiting code for the scheduler. reads and writes go to
  different queues (phyiscally now, logically eventually). The limit on
  the number of outstanding reads / writes can be controlled
  independently on a per drive basis.
  
  Implement proper 32.32 * 32.32 multiplication and use it when we're
  multiplying two 32.32 numbers.
  
  Ensure SD calculation is as robust as possible, and return 0 when
  we're unable to properly estimate it (such as early in startup when
  the long term values haven't converged).
  
  It is looking like work loads very too much to make the fast decaying
  average a useful metric. Implement an alternate method of estimating a
  steering point for unacceptable latency. We now train for the first
  few hundred I/Os to see what the drive will normally do unloaded, then
  use a multiple of the 95% latency to set a threshold for badness (to
  reflect the number of channels in the disk). We retain the old slow
  decaying average as well.
  
  Create a read_bias parameter for breaking ties between read and writes
  when both are present. This ensures that we make some progress on
  writes, while still balancing them against reads that are competing
  against them. Set it to 100 to 1.
  
  Allow for a tunable to control whether we do the netflix schedule
  stuff or not. Due to the large amount of state we keep, and due to
  the difficulty in 'draining' pending I/Os we only allow it to be set
  at boot.

Modified:
  projects/iosched/sys/cam/cam_iosched.c
  projects/iosched/sys/cam/cam_xpt.c
  projects/iosched/sys/conf/options

Modified: projects/iosched/sys/cam/cam_iosched.c
==============================================================================
--- projects/iosched/sys/cam/cam_iosched.c	Fri Mar  6 00:24:07 2015	(r279681)
+++ projects/iosched/sys/cam/cam_iosched.c	Fri Mar  6 00:24:21 2015	(r279682)
@@ -28,6 +28,10 @@
  * $FreeBSD$
  */
 
+#include "opt_cam.h"
+#include "opt_ddb.h"
+#define CAM_NETFLIX_IOSCHED 1 /* hack xxx */
+
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
@@ -47,7 +51,10 @@ __FBSDID("$FreeBSD$");
 #include <cam/cam_xpt_periph.h>
 #include <cam/cam_iosched.h>
 
-static MALLOC_DEFINE(M_CAMSCHED, "CAM I/O Scheduler", "CAM I/O Scheduler buffers");
+#include <ddb/ddb.h>
+
+static MALLOC_DEFINE(M_CAMSCHED, "CAM I/O Scheduler",
+    "CAM I/O Scheduler buffers");
 
 /*
  * Default I/O scheduler for FreeBSD. This implementation is just a thin-vineer
@@ -55,13 +62,68 @@ static MALLOC_DEFINE(M_CAMSCHED, "CAM I/
  * for trims.
  */
 
+#ifdef CAM_NETFLIX_IOSCHED
+#define IOP_MAX_SKIP 50
+#define IOP_MAX_TRAINING 500
+#define ALPHA_BITS 14                                   /* ~32k events or about the last minute */
+
+SYSCTL_DECL(_kern_cam);
+static int do_netflix_iosched = 1;
+TUNABLE_INT("kern.cam.do_netflix_iosched", &do_netflix_iosched);
+SYSCTL_INT(_kern_cam, OID_AUTO, do_netflix_iosched, CTLFLAG_RD,
+    &do_netflix_iosched, 1,
+    "Enable Netflix I/O scheduler optimizations.");
+
+int iosched_debug = 0;
+
+struct iop_stats 
+{
+	sbintime_t      data[IOP_MAX_TRAINING];	/* Data for training period */
+	sbintime_t	worst;		/* estimate of worst case latency */
+	int		outliers;	/* Number of outlier latency I/Os */
+	int		skipping;	/* Skipping I/Os when < IOP_MAX_SKIP */
+	int		training;	/* Training when < IOP_MAX_TRAINING */
+		/* Exp Moving Average, alpha = 1 / (1 << alpha_bits) */
+	sbintime_t      ema;
+	sbintime_t      emss;		/* Exp Moving sum of the squares */
+	sbintime_t      sd;		/* Last computed sd */
+};
+#endif
+
 struct cam_iosched_softc
 {
 	struct bio_queue_head bio_queue;
 	struct bio_queue_head trim_queue;
 				/* scheduler flags < 16, user flags >= 16 */
 	uint32_t	flags;
-	int	 sort_io_queue;
+	int		sort_io_queue;
+#ifdef CAM_NETFLIX_IOSCHED
+	/* Number of pending transactions */
+	int		pending_reads;
+	int		pending_writes;
+	/* Have at least this many transactions in progress, if possible */
+	int		min_reads;
+	int		min_writes;
+	/* Maximum number of each type of transaction in progress */
+	int		max_reads;
+	int		max_writes;
+	
+	int		trims;
+	int		reads;
+	int		writes;
+	int		queued_reads;
+	int		queued_writes;
+	int		in_reads;
+	int		in_writes;
+	int		out_reads;
+	int		out_writes;
+
+	int		read_bias;
+	int		current_read_bias;
+
+	struct bio_queue_head write_queue;
+	struct iop_stats read_stats, write_stats, trim_stats;
+#endif
 };
 
 			/* Trim or similar currently pending completion */
@@ -70,6 +132,10 @@ struct cam_iosched_softc
 			/* Periph drivers set these flags to indicate work */
 #define CAM_IOSCHED_FLAG_WORK_FLAGS	((0xffffu) << 16)
 
+static void
+cam_iosched_io_metric_update(struct cam_iosched_softc *isc,
+    sbintime_t sim_latency, int cmd, size_t size);
+
 static inline int
 cam_iosched_has_flagged_work(struct cam_iosched_softc *isc)
 {
@@ -79,7 +145,23 @@ cam_iosched_has_flagged_work(struct cam_
 static inline int
 cam_iosched_has_io(struct cam_iosched_softc *isc)
 {
-	return bioq_first(&(isc)->bio_queue) != NULL;
+#ifdef CAM_NETFLIX_IOSCHED
+	if (do_netflix_iosched) {
+		int can_write = bioq_first(&isc->write_queue) != NULL &&
+		    (isc->max_writes <= 0 ||
+			isc->pending_writes < isc->max_writes);
+		int can_read = bioq_first(&isc->bio_queue) != NULL &&
+		    (isc->max_reads <= 0 ||
+			isc->pending_reads < isc->max_reads);
+		if (iosched_debug > 2) {
+			printf("can write %d: pending_writes %d max_writes %d\n", can_write, isc->pending_writes, isc->max_writes);
+			printf("can read %d: pending_reads %d max_reads %d\n", can_read, isc->pending_reads, isc->max_reads);
+			printf("Queued reads %d writes %d\n", isc->queued_reads, isc->queued_writes);
+		}
+		return can_read || can_write;
+	}
+#endif
+	return bioq_first(&isc->bio_queue) != NULL;
 }
 
 static inline int
@@ -96,11 +178,28 @@ cam_iosched_has_more_trim(struct cam_ios
 static inline int
 cam_iosched_has_work(struct cam_iosched_softc *isc)
 {
+	if (iosched_debug > 2)
+		printf("has work: %d %d %d\n", cam_iosched_has_io(isc),
+		    cam_iosched_has_more_trim(isc),
+		    cam_iosched_has_flagged_work(isc));
+
 	return cam_iosched_has_io(isc) ||
 		cam_iosched_has_more_trim(isc) ||
 		cam_iosched_has_flagged_work(isc);
 }
 
+static void
+iop_stats_init(struct iop_stats *ios)
+{
+	ios->ema = 0;
+	ios->emss = 0;
+	ios->sd = 0;
+	ios->outliers = 0;
+	ios->skipping = 0;
+	ios->training = 0;
+	ios->outliers = 0;
+}
+
 /*
  * Allocate the iosched structure. This also insulates callers from knowing
  * sizeof struct cam_iosched_softc.
@@ -112,9 +211,36 @@ cam_iosched_init(struct cam_iosched_soft
 	*iscp = malloc(sizeof(**iscp), M_CAMSCHED, M_NOWAIT | M_ZERO);
 	if (*iscp == NULL)
 		return ENOMEM;
+	if (iosched_debug)
+		printf("CAM IOSCHEDULER Allocating entry at %p\n", *iscp);
 	(*iscp)->sort_io_queue = -1;
 	bioq_init(&(*iscp)->bio_queue);
 	bioq_init(&(*iscp)->trim_queue);
+#ifdef CAM_NETFLIX_IOSCHED
+	if (do_netflix_iosched) {
+		bioq_init(&(*iscp)->write_queue);
+		(*iscp)->pending_reads = 0;
+		(*iscp)->pending_writes = 0;
+		(*iscp)->min_reads = 1;
+		(*iscp)->min_writes = 1;
+		(*iscp)->max_reads = 0;
+		(*iscp)->max_writes = 0;
+		(*iscp)->trims = 0;
+		(*iscp)->reads = 0;
+		(*iscp)->writes = 0;
+		(*iscp)->queued_reads = 0;
+		(*iscp)->queued_writes = 0;
+		(*iscp)->out_reads = 0;
+		(*iscp)->out_writes = 0;
+		(*iscp)->in_reads = 0;
+		(*iscp)->in_writes = 0;
+		(*iscp)->read_bias = 100;
+		(*iscp)->current_read_bias = 100;
+		iop_stats_init(&(*iscp)->read_stats);
+		iop_stats_init(&(*iscp)->write_stats);
+		iop_stats_init(&(*iscp)->trim_stats);
+	}
+#endif
 
 	return 0;
 }
@@ -141,6 +267,150 @@ void cam_iosched_sysctl_init(struct cam_
 		OID_AUTO, "sort_io_queue", CTLFLAG_RW | CTLFLAG_MPSAFE,
 		&isc->sort_io_queue, 0,
 		"Sort IO queue to try and optimise disk access patterns");
+#ifdef CAM_NETFLIX_IOSCHED
+	if (!do_netflix_iosched)
+		return;
+
+	/*
+	 * Read stats
+	 */
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "read_ema", CTLFLAG_RD,
+	    &isc->read_stats.ema,
+	    "Fast Exponentially Weighted Moving Average for reads");
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "read_emss", CTLFLAG_RD,
+	    &isc->read_stats.emss,
+	    "Fast Exponentially Weighted Moving Sum of Squares for reads");
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "read_sd", CTLFLAG_RD,
+	    &isc->read_stats.sd,
+	    "Estimated SD for fast read ema");
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "read_bad_latency", CTLFLAG_RD,
+	    &isc->read_stats.worst,
+	    "read bad latency threshold");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "read_outliers", CTLFLAG_RD,
+	    &isc->read_stats.outliers, 0,
+	    "read latency outliers");
+
+	/*
+	 * Write stats
+	 */
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "write_ema", CTLFLAG_RD,
+	    &isc->write_stats.ema,
+	    "Fast Exponentially Weighted Moving Average for writes");
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "write_emss", CTLFLAG_RD,
+	    &isc->write_stats.emss,
+	    "Fast Exponentially Weighted Moving Sum of Squares for writes");
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "write_sd", CTLFLAG_RD,
+	    &isc->write_stats.sd,
+	    "Estimated SD for fast write ema");
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "write_bad_latency", CTLFLAG_RD,
+	    &isc->write_stats.worst,
+	    "write bad latency threshold");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "write_outliers", CTLFLAG_RD,
+	    &isc->write_stats.outliers, 0,
+	    "write latency outliers");
+
+	/*
+	 * Trim stats
+	 */
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "trim_ema", CTLFLAG_RD,
+	    &isc->trim_stats.ema,
+	    "Fast Exponentially Weighted Moving Average for trims");
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "trim_emss", CTLFLAG_RD,
+	    &isc->trim_stats.emss,
+	    "Fast Exponentially Weighted Moving Sum of Squares for trims");
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "trim_sd", CTLFLAG_RD,
+	    &isc->trim_stats.sd,
+	    "Estimated SD for fast trim ema");
+	SYSCTL_ADD_UQUAD(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "trim_bad_latency", CTLFLAG_RD,
+	    &isc->trim_stats.worst,
+	    "trim bad latency threshold");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "trim_outliers", CTLFLAG_RD,
+	    &isc->trim_stats.outliers, 0,
+	    "trim latency outliers");
+
+	/*
+	 * Other misc knobs.
+	 */
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "pending_reads", CTLFLAG_RD,
+	    &isc->pending_reads, 0,
+	    "Instantaneous # of pending reads");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "pending_writes", CTLFLAG_RD,
+	    &isc->pending_writes, 0,
+	    "Instantaneous # of pending writes");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "trims", CTLFLAG_RD,
+	    &isc->trims, 0,
+	    "# of trims");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "writes", CTLFLAG_RD,
+	    &isc->writes, 0,
+	    "# of writes submitted to hardware");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "reads", CTLFLAG_RD,
+	    &isc->reads, 0,
+	    "# of reads submitted to hardware");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "queued_writes", CTLFLAG_RD,
+	    &isc->queued_writes, 0,
+	    "# of writes in the queue");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "queued_reads", CTLFLAG_RD,
+	    &isc->queued_reads, 0,
+	    "# of reads in the queue");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "in_writes", CTLFLAG_RD,
+	    &isc->in_writes, 0,
+	    "# of writes queued");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "in_reads", CTLFLAG_RD,
+	    &isc->in_reads, 0,
+	    "# of reads queued");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "out_writes", CTLFLAG_RD,
+	    &isc->out_writes, 0,
+	    "# of writes completed");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "out_reads", CTLFLAG_RD,
+	    &isc->in_reads, 0,
+	    "# of reads completed");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "read_bias", CTLFLAG_RW,
+	    &isc->read_bias, 100,
+	    "How biased towards read should we be");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "min_reads", CTLFLAG_RW,
+	    &isc->min_reads, 0,
+	    "min reads reserved in the queue");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "min_writes", CTLFLAG_RW,
+	    &isc->min_writes, 0,
+	    "min writes reserved in the queue");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "max_reads", CTLFLAG_RW,
+	    &isc->max_reads, 0,
+	    "max reads reserved in the queue");
+	SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(node),
+	    OID_AUTO, "max_writes", CTLFLAG_RW,
+	    &isc->max_writes, 0,
+	    "max writes reserved in the queue");
+#endif
 }
 
 /*
@@ -153,8 +423,58 @@ cam_iosched_flush(struct cam_iosched_sof
 {
 	bioq_flush(&isc->bio_queue, stp, err);
 	bioq_flush(&isc->trim_queue, stp, err);
+#ifdef CAM_NETFLIX_IOSCHED
+	if (do_netflix_iosched)
+		bioq_flush(&isc->write_queue, stp, err);
+#endif
 }
 
+#ifdef CAM_NETFLIX_IOSCHED
+static struct bio *
+cam_iosched_get_write(struct cam_iosched_softc *isc)
+{
+	struct bio *bp;
+
+	/*
+	 * We control the write rate by controlling how many requests we send
+	 * down to the drive at any one time. Fewer requests limits the
+	 * effects of both starvation when the requests take a while and write
+	 * amplification when each request is causing more than one write to
+	 * the NAND media. Limiting the queue depth like this will also limit
+	 * the write throughput and give and reads that want to compete to
+	 * compete unfairly.
+	 */
+	if (isc->max_writes > 0 && isc->pending_writes >= isc->max_writes) {
+		if (iosched_debug)
+			printf("Can't write because pending_writes %d and max_writes %d\n",
+			    isc->pending_writes, isc->max_writes);
+		return NULL;
+	}
+	bp = bioq_first(&isc->write_queue);
+	if (bp == NULL) {
+		if (iosched_debug > 3)
+			printf("No writes present in write_queue\n");
+		return NULL;
+	}
+	if (bioq_first(&isc->bio_queue) && isc->current_read_bias) {
+		if (iosched_debug)
+			printf("Reads present and current_read_bias is %d queued writes %d queued reads %d\n", isc->current_read_bias, isc->queued_writes, isc->queued_reads);
+		isc->current_read_bias--;
+		return NULL;
+	}
+	isc->current_read_bias = isc->read_bias;
+	bioq_remove(&isc->write_queue, bp);
+	if (bp->bio_cmd == BIO_WRITE) {
+		isc->queued_writes--;
+		isc->writes++;
+		isc->pending_writes++;
+	}
+	if (iosched_debug > 9)
+		printf("HWQ : %p %#x\n", bp, bp->bio_cmd);
+	return bp;
+}
+#endif
+
 /*
  * Put back a trim that you weren't able to actually schedule this time.
  */
@@ -162,6 +482,7 @@ void
 cam_iosched_put_back_trim(struct cam_iosched_softc *isc, struct bio *bp)
 {
 	bioq_insert_head(&isc->trim_queue, bp);
+	isc->trims--;
 }
 
 /*
@@ -180,6 +501,7 @@ cam_iosched_next_trim(struct cam_iosched
 	if (bp == NULL)
 		return NULL;
 	bioq_remove(&isc->trim_queue, bp);
+	isc->trims++;
 	return bp;
 }
 
@@ -212,17 +534,53 @@ cam_iosched_next_bio(struct cam_iosched_
 	struct bio *bp;
 
 	/*
-	 * First, see if we have a trim that can be scheduled. We can only send one
+	 * See if we have a trim that can be scheduled. We can only send one
 	 * at a time down, so this takes that into account.
+	 *
+	 * XXX newer TRIM commands are queueable. Revisit this when we
+	 * implement them.
 	 */
 	if ((bp = cam_iosched_get_trim(isc)) != NULL)
 		return bp;
 
+#ifdef CAM_NETFLIX_IOSCHED
 	/*
-	 * next, see if there's a normal I/O waiting. If so return that.
+	 * See if we have any pending writes, and room in the queue for them,
+	 * and if so, those are next.
 	 */
+	if (do_netflix_iosched) {
+		if ((bp = cam_iosched_get_write(isc)) != NULL)
+			return bp;
+	}
+#endif
+
+	/*
+	 * next, see if there's other, normal I/O waiting. If so return that.
+	 */
+#ifdef CAM_NETFLIX_IOSCHED
+	/*
+	 * For the netflix scheduler, bio_queue is only for reads, so enforce
+	 * the limits here.
+	 */
+	if (do_netflix_iosched) {
+		if (isc->max_reads > 0 && isc->pending_reads >= isc->max_reads)
+			return NULL;
+	}
+#endif
 	if ((bp = bioq_first(&isc->bio_queue)) != NULL) {
 		bioq_remove(&isc->bio_queue, bp);
+#ifdef CAM_NETFLIX_IOSCHED
+		if (do_netflix_iosched) {
+			if (bp->bio_cmd == BIO_READ) {
+				isc->queued_reads--;
+				isc->reads++;
+				isc->pending_reads++;
+			} else
+				printf("Found bio_cmd = %#x\n", bp->bio_cmd);
+		}
+#endif
+		if (iosched_debug > 9)
+			printf("HWQ : %p %#x\n", bp, bp->bio_cmd);
 		return bp;
 	}
 
@@ -247,12 +605,39 @@ cam_iosched_queue_work(struct cam_iosche
 	 * that the collapsing code requires this. Otherwise put
 	 * the work on the bio queue.
 	 */
-	if (bp->bio_cmd == BIO_DELETE)
+	if (bp->bio_cmd == BIO_DELETE) {
 		bioq_disksort(&isc->trim_queue, bp);
-	else if (cam_iosched_sort_queue(isc))
-		bioq_disksort(&isc->bio_queue, bp);
-	else
-		bioq_insert_tail(&isc->bio_queue, bp);
+	}
+#ifdef CAM_NETFLIX_IOSCHED
+	else if (do_netflix_iosched &&
+	    (bp->bio_cmd == BIO_WRITE || bp->bio_cmd == BIO_FLUSH)) {
+		if (cam_iosched_sort_queue(isc))
+			bioq_disksort(&isc->write_queue, bp);
+		else
+			bioq_insert_tail(&isc->write_queue, bp);
+		if (iosched_debug > 9)
+			printf("Qw  : %p %#x\n", bp, bp->bio_cmd);
+		if (bp->bio_cmd == BIO_WRITE) {
+			isc->in_writes++;
+			isc->queued_writes++;
+		}
+	}
+#endif
+	else {
+		if (cam_iosched_sort_queue(isc))
+			bioq_disksort(&isc->bio_queue, bp);
+		else
+			bioq_insert_tail(&isc->bio_queue, bp);
+		if (iosched_debug > 9)
+			printf("Qr  : %p %#x\n", bp, bp->bio_cmd);
+		if (bp->bio_cmd == BIO_READ) {
+			isc->in_reads++;
+			isc->queued_reads++;
+		} else if (bp->bio_cmd == BIO_WRITE) {
+			isc->in_writes++;
+			isc->queued_writes++;
+		}
+	}
 }
 
 /* 
@@ -283,7 +668,33 @@ cam_iosched_trim_done(struct cam_iosched
 int
 cam_iosched_bio_complete(struct cam_iosched_softc *isc, struct bio *bp, union ccb *done_ccb)
 {
-	return 0;
+	int retval = 0;
+#ifdef CAM_NETFLIX_IOSCHED
+	if (!do_netflix_iosched)
+		return retval;
+
+	if (iosched_debug > 10)
+		printf("done: %p %#x\n", bp, bp->bio_cmd);
+	if (bp->bio_cmd == BIO_WRITE) {
+		retval = isc->max_writes > 0 && isc->pending_writes == isc->max_writes;
+		if (isc->max_writes > 0 && retval && iosched_debug)
+			printf("NEEDS SCHED\n");
+		isc->out_writes++;
+		isc->pending_writes--;
+	} else if (bp->bio_cmd == BIO_READ) {
+		retval = isc->max_reads > 0 && isc->pending_reads == isc->max_reads;
+		isc->out_reads++;
+		isc->pending_reads--;
+	} else if (bp->bio_cmd != BIO_FLUSH && bp->bio_cmd != BIO_DELETE) {
+		if (iosched_debug)
+			printf("Completing command with bio_cmd == %#x\n", bp->bio_cmd);
+	}
+
+	if (!(bp->bio_flags & BIO_ERROR))
+		cam_iosched_io_metric_update(isc, done_ccb->ccb_h.qos.sim_data,
+		    bp->bio_cmd, bp->bio_bcount);
+#endif
+	return retval;
 }
 
 /*
@@ -324,3 +735,249 @@ cam_iosched_clr_work_flags(struct cam_io
 {
 	isc->flags &= ~flags;
 }
+
+#ifdef CAM_NETFLIX_IOSCHED
+/*
+ * After the method presented in Jack Crenshaw's 1998 article "Integer
+ * Suqare Roots," reprinted at
+ * http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots
+ * and well worth the read. Briefly, we find the power of 4 that's the
+ * largest smaller than val. We then check each smaller power of 4 to
+ * see if val is still bigger. The right shifts at each step divide
+ * the result by 2 which after successive application winds up
+ * accumulating the right answer. It could also have been accumulated
+ * using a separate root counter, but this code is smaller and faster
+ * than that method. This method is also integer size invariant.
+ * It returns floor(sqrt((float)val)), or the larget integer less than
+ * or equal to the square root.
+ */
+static uint64_t
+isqrt64(uint64_t val)
+{
+	uint64_t res = 0;
+	uint64_t bit = 1ULL << (sizeof(uint64_t) * NBBY - 2);
+
+	/*
+	 * Find the largest power of 4 smaller than val.
+	 */
+	while (bit > val)
+		bit >>= 2;
+
+	/*
+	 * Accumulate the answer, one bit at a time (we keep moving
+	 * them over since 2 is the square root of 4 and we test
+	 * powers of 4). We accumulate where we find the bit, but
+	 * the successive shifts land the bit in the right place
+	 * by the end.
+	 */
+	while (bit != 0) {
+		if (val >= res + bit) {
+			val -= res + bit;
+			res = (res >> 1) + bit;
+		} else
+			res >>= 1;
+		bit >>= 2;
+	}
+	
+	return res;
+}
+
+/*
+ * a and b are 32.32 fixed point stored in a 64-bit word.
+ * Let al and bl be the .32 part of a and b.
+ * Let ah and bh be the 32 part of a and b.
+ * R is the radix and is 1 << 32
+ *
+ * a * b
+ * (ah + al / R) * (bh + bl / R)
+ * ah * bh + (al * bh + ah * bl) / R + al * bl / R^2
+ *
+ * After multiplicaiton, we have to renormalize by multiply by
+ * R, so we wind up with
+ *	ah * bh * R + al * bh + ah * bl + al * bl / R
+ * which turns out to be a very nice way to compute this value
+ * so long as ah and bh are < 65536 there's no loss of high bits
+ * and the low order bits are below the threshold of caring for
+ * this application.
+ */
+static uint64_t
+mul(uint64_t a, uint64_t b)
+{
+	uint64_t al, ah, bl, bh;
+	al = a & 0xffffffff;
+	ah = a >> 32;
+	bl = b & 0xffffffff;
+	bh = b >> 32;
+	return ((ah * bh) << 32) + al * bh + ah * bl + ((al * bl) >> 32);
+}
+
+static int
+cmp(const void *a, const void *b)
+{
+	return (int)(*(sbintime_t *)a - *(sbintime_t *)b);
+}
+
+static void
+cam_iosched_update(struct iop_stats *iop, sbintime_t sim_latency)
+{
+	sbintime_t y, yy;
+	uint64_t var;
+
+	/*
+	 * Skip the first ~50 I/Os since experience has shown the first few are
+	 * atypical.
+	 */
+	if (iop->skipping < IOP_MAX_SKIP) {
+		iop->skipping++;
+		return;
+	}
+
+	/*
+	 * After the first few, the next few hundred are typical of what we'd
+	 * expect.
+	 */
+	if (iop->training < IOP_MAX_TRAINING) {
+		iop->data[iop->training++] = sim_latency;
+		if (iop->training == IOP_MAX_TRAINING) {
+			qsort(iop->data, nitems(iop->data),
+			    sizeof(iop->data[0]), cmp);
+			/* 95th percentile */
+			iop->worst = iop->data[IOP_MAX_TRAINING * 95 / 100];
+			/*
+			 * We expect about 5% of the I/Os to be above the 95th
+			 * percentile. However, the training happens early in
+			 * boot where things are somewhat ideal compared to
+			 * the fully loaded system, so allow for a wide margin
+			 * of error before declaring things outliers by
+			 * multiplying that by 4 to allow for expected
+			 * degredation.
+			 */
+			iop->worst *= 4;
+		}
+		return;
+	}
+
+	if (sim_latency > iop->worst)
+		iop->outliers++;
+
+	/*
+	 * We determine this new measurement is an outlier before we update
+	 * our estimators. If sd is 0, then we have no reliable estimate for
+	 * SD yet so don't bother seeing if this is an outlier.
+	 */
+	if (iop->sd != 0 && sim_latency > iop->ema + 5 * iop->sd) {
+//		printf("Outlier: EMA: %ju SD: %ju lat %ju\n", (intmax_t)iop->ema,
+//		       (intmax_t)iop->sd, (intmax_t)sim_latency);
+		iop->outliers++;
+	}
+
+	/* 
+	 * Classic expoentially decaying average with a tiny alpha
+	 * (2 ^ -alpha_bits). For more info see the NIST statistical
+	 * handbook.
+	 *
+	 * ema_t = y_t * alpha + ema_t-1 * (1 - alpha)
+	 * alpha = 1 / (1 << alpha_bits)
+	 *
+	 * Since alpha is a power of two, we can compute this w/o any mult or
+	 * division.
+	 */
+	y = sim_latency;
+	iop->ema = (y + (iop->ema << ALPHA_BITS) - iop->ema) >> ALPHA_BITS;
+
+	yy = mul(y, y);
+	iop->emss = (yy + (iop->emss << ALPHA_BITS) - iop->emss) >> ALPHA_BITS;
+
+	/*
+         * s_1 = sum of data
+	 * s_2 = sum of data * data
+	 * ema ~ mean (or s_1 / N)
+	 * emss ~ s_2 / N
+	 *
+	 * sd = sqrt((N * s_2 - s_1 ^ 2) / (N * (N - 1)))
+	 * sd = sqrt((N * s_2 / N * (N - 1)) - (s_1 ^ 2 / (N * (N - 1))))
+	 *
+	 * N ~ 2 / alpha - 1
+	 * alpha < 1 / 16 (typically much less)
+	 * N > 31 --> N large so N * (N - 1) is approx N * N
+	 *
+	 * substituting and rearranging:
+	 * sd ~ sqrt(s_2 / N - (s_1 / N) ^ 2)
+	 *    ~ sqrt(emss - ema ^ 2);
+	 * which is the formula used here to get a decent estimate of sd which
+	 * we use to detect outliers. Note that when first starting up, it
+	 * takes a while for emss sum of squares estimator to converge on a
+	 * good value.  during this time, it can be less than ema^2. We
+	 * compute a sd of 0 in that case, and ignore outliers.
+	 */
+	var = iop->emss - mul(iop->ema, iop->ema);
+	iop->sd = (int64_t)var < 0 ? 0 : isqrt64(var);
+}
+
+static void
+cam_iosched_io_metric_update(struct cam_iosched_softc *isc,
+    sbintime_t sim_latency, int cmd, size_t size)
+{
+	/* xxx Do we need to scale based on the size of the I/O ? */
+	switch (cmd) {
+	case BIO_READ:
+		cam_iosched_update(&isc->read_stats, sim_latency);
+		break;
+	case BIO_WRITE:
+		cam_iosched_update(&isc->write_stats, sim_latency);
+		break;
+	case BIO_DELETE:
+		cam_iosched_update(&isc->trim_stats, sim_latency);
+		break;
+	default:
+		break;
+	}
+}
+
+#ifdef DDB
+static int biolen(struct bio_queue_head *bq)
+{
+	int i = 0;
+	struct bio *bp;
+
+	TAILQ_FOREACH(bp, &bq->queue, bio_queue) {
+		i++;
+	}
+	return i;
+}
+
+/*
+ * Show the internal state of the I/O scheduler.
+ */
+DB_SHOW_COMMAND(iosched, cam_iosched_db_show)
+{
+	struct cam_iosched_softc *isc;
+
+	if (!have_addr) {
+		db_printf("Need addr\n");
+		return;
+	}
+	isc = (struct cam_iosched_softc *)addr;
+	db_printf("pending_reads:     %d\n", isc->pending_reads);
+	db_printf("min_reads:         %d\n", isc->min_reads);
+	db_printf("max_reads:         %d\n", isc->max_reads);
+	db_printf("reads:             %d\n", isc->reads);
+	db_printf("in_reads:          %d\n", isc->in_reads);
+	db_printf("out_reads:         %d\n", isc->out_reads);
+	db_printf("queued_reads:      %d\n", isc->queued_reads);
+	db_printf("Current Q len      %d\n", biolen(&isc->bio_queue));
+	db_printf("pending_writes:    %d\n", isc->pending_writes);
+	db_printf("min_writes:        %d\n", isc->min_writes);
+	db_printf("max_writes:        %d\n", isc->max_writes);
+	db_printf("writes:            %d\n", isc->writes);
+	db_printf("in_writes:         %d\n", isc->in_writes);
+	db_printf("out_writes:        %d\n", isc->out_writes);
+	db_printf("queued_writes:     %d\n", isc->queued_writes);
+	db_printf("Current Q len      %d\n", biolen(&isc->write_queue));
+	db_printf("read_bias:         %d\n", isc->read_bias);
+	db_printf("current_read_bias: %d\n", isc->current_read_bias);
+	db_printf("Trim active?       %s\n", 
+	    (isc->flags & CAM_IOSCHED_FLAG_TRIM_ACTIVE) ? "yes" : "no");
+}
+#endif
+#endif

Modified: projects/iosched/sys/cam/cam_xpt.c
==============================================================================
--- projects/iosched/sys/cam/cam_xpt.c	Fri Mar  6 00:24:07 2015	(r279681)
+++ projects/iosched/sys/cam/cam_xpt.c	Fri Mar  6 00:24:21 2015	(r279682)
@@ -3304,6 +3304,7 @@ xpt_run_devq(struct cam_devq *devq)
 		lock = (mtx_owned(sim->mtx) == 0);
 		if (lock)
 			CAM_SIM_LOCK(sim);
+		work_ccb->ccb_h.qos.sim_data = sbinuptime(); // xxx uintprt_t too small 32bit platforms
 		(*(sim->sim_action))(sim, work_ccb);
 		if (lock)
 			CAM_SIM_UNLOCK(sim);
@@ -4462,6 +4463,8 @@ xpt_done(union ccb *done_ccb)
 	if ((done_ccb->ccb_h.func_code & XPT_FC_QUEUED) == 0)
 		return;
 
+	/* Store the time the ccb was in the sim */
+	done_ccb->ccb_h.qos.sim_data = sbinuptime() - done_ccb->ccb_h.qos.sim_data;
 	hash = (done_ccb->ccb_h.path_id + done_ccb->ccb_h.target_id +
 	    done_ccb->ccb_h.target_lun) % cam_num_doneqs;
 	queue = &cam_doneqs[hash];
@@ -4482,6 +4485,8 @@ xpt_done_direct(union ccb *done_ccb)
 	if ((done_ccb->ccb_h.func_code & XPT_FC_QUEUED) == 0)
 		return;
 
+	/* Store the time the ccb was in the sim */
+	done_ccb->ccb_h.qos.sim_data = sbinuptime() - done_ccb->ccb_h.qos.sim_data;
 	xpt_done_process(&done_ccb->ccb_h);
 }
 

Modified: projects/iosched/sys/conf/options
==============================================================================
--- projects/iosched/sys/conf/options	Fri Mar  6 00:24:07 2015	(r279681)
+++ projects/iosched/sys/conf/options	Fri Mar  6 00:24:21 2015	(r279682)
@@ -326,6 +326,7 @@ CAM_DEBUG_TARGET	opt_cam.h
 CAM_DEBUG_LUN		opt_cam.h
 CAM_DEBUG_FLAGS		opt_cam.h
 CAM_BOOT_DELAY		opt_cam.h
+CAM_NETFLIX_IOSCHED	opt_cam.h
 SCSI_DELAY		opt_scsi.h
 SCSI_NO_SENSE_STRINGS	opt_scsi.h
 SCSI_NO_OP_STRINGS	opt_scsi.h