From owner-freebsd-current@FreeBSD.ORG Fri Oct 11 22:39:56 2013 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 42AD9F9B; Fri, 11 Oct 2013 22:39:56 +0000 (UTC) (envelope-from maksim.yevmenkin@gmail.com) Received: from mail-pb0-x233.google.com (mail-pb0-x233.google.com [IPv6:2607:f8b0:400e:c01::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 153F428FC; Fri, 11 Oct 2013 22:39:56 +0000 (UTC) Received: by mail-pb0-f51.google.com with SMTP id jt11so4882568pbb.10 for ; Fri, 11 Oct 2013 15:39:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:from:subject:date:to; bh=OMHKXy4+Q/EKfFE134rwlvY3yKQoyC7YOyJoEtDkF2M=; b=ebcCR5kuCNvLrya8tXm1zwdG1mW+lpyhtpzKTW2df4UDGG7e4eY/QptM4slh+7C//u OPELdQ55IGCpTBYEY98pJYeeF4rna6lkGy/eVxWKrVMudja0GazdWO+E5fiB2J4uD6I7 eVhoD3uCUIGoVcYIxYAym1PygRULESRXvDLwG4fNbksXD7wexVqpFF02JwhxlEgDxxB5 RGAWCdmrBrayVidQqJsW0JYW8KwJfWTxrlyAg4GUREtxV3ibnQYS99lqW06vjsdl2Sae QlOjmaKsai+ic6jZeywzcil/Z3U1bmD0Kz2Ofl950in+6PKeghUtIgykmoko5OgC+hDS TpSQ== X-Received: by 10.68.131.165 with SMTP id on5mr17683199pbb.165.1381531195557; Fri, 11 Oct 2013 15:39:55 -0700 (PDT) Received: from [192.168.1.4] (pool-71-118-241-194.lsanca.fios.verizon.net. [71.118.241.194]) by mx.google.com with ESMTPSA id ye1sm72875874pab.19.1969.12.31.16.00.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 11 Oct 2013 15:39:54 -0700 (PDT) References: <20131011215210.GY56872@funkthat.com> Mime-Version: 1.0 (1.0) In-Reply-To: <20131011215210.GY56872@funkthat.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: <72DA2C4F-44F0-456D-8679-A45CE617F8E6@gmail.com> X-Mailer: iPad Mail (11A501) From: Maksim Yevmenkin Subject: Re: [rfc] small bioq patch Date: Fri, 11 Oct 2013 15:39:53 -0700 To: John-Mark Gurney Cc: Maksim Yevmenkin , "current@freebsd.org" X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Oct 2013 22:39:56 -0000 > On Oct 11, 2013, at 2:52 PM, John-Mark Gurney wrote: >=20 > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700: >> i would like to submit the attached bioq patch for review and >> comments. this is proof of concept. it helps with smoothing disk read >> service times and arrear to eliminates outliers. please see attached >> pictures (about a week worth of data) >>=20 >> - c034 "control" unmodified system >> - c044 patched system >=20 > Can you describe how you got this data? Were you using the gstat > code or some other code? Yes, it's basically gstat data.=20 > Also, was your control system w/ the patch, but w/ the sysctl set to > zero to possibly eliminate any code alignment issues? Both systems use the same code base and build. Patched system has patch incl= uded, "control" system does not have the patch. I can rerun my tests with sy= sctl set to zero and use it as "control". So, the answer to your question is= "no".=20 >> graphs show max/avg disk read service times for both systems across 36 >> spinning drives. both systems are relatively busy serving production >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs >> represent time when systems are refreshing their content, i.e. disks >> are both reading and writing at the same time. >=20 > Can you describe why you think this change makes an improvement? Unless > you're running 10k or 15k RPM drives, 128 seems like a large number.. as > that's about halve number of IOPs that a normal HD handles in a second.. Our (Netflix) load is basically random disk io. We have tweaked the system t= o ensure that our io path is "wide" enough, I.e. We read 1mb per disk io for= majority of the requests. However offsets we read from are all over the pla= ce. It appears that we are getting into situation where larger offsets are g= etting delayed because smaller offsets are "jumping" ahead of them. Forcing b= ioq insert tail operation and effectively moving insertion point seems to he= lp avoiding getting into this situation. And, no. We don't use 10k or 15k dr= ives. Just regular enterprise 7200 sata drives.=20 > I assume you must be regularly seeing queue depths of 128+ for this > code to make a difference, do you see that w/ gstat? No, we don't see large (128+) queue sizes in gstat data. The way I see it, w= e don't have to have deep queue here. We could just have a steady stream of i= o requests where new, smaller, offsets consistently "jumping" ahead of older= , larger offset. In fact gstat data show shallow queue of 5 or less items. > Also, do you see a similar throughput of the system? Yes. We do see almost identical throughput from both systems. I have not pu= shed the system to its limit yet, but having much smoother disk read service= time is important for us because we use it as one of the components of syst= em health metrics. We also need to ensure that disk io request is actually d= ispatched to the disk in a timely manner.=20 Thanks Max