From owner-freebsd-current@FreeBSD.ORG  Tue Oct 15 18:15:26 2013
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 47A131F1;
 Tue, 15 Oct 2013 18:15:26 +0000 (UTC)
 (envelope-from maksim.yevmenkin@gmail.com)
Received: from mail-wg0-x22d.google.com (mail-wg0-x22d.google.com
 [IPv6:2a00:1450:400c:c00::22d])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 9A2442D34;
 Tue, 15 Oct 2013 18:15:25 +0000 (UTC)
Received: by mail-wg0-f45.google.com with SMTP id z12so7115718wgg.0
 for <multiple recipients>; Tue, 15 Oct 2013 11:15:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :content-type:content-transfer-encoding;
 bh=BFvy5l1swFpCN0sNLb7nsOX7ZtdmmrbD/jdJJMt0A8g=;
 b=oKyEfksYvtwpRxTh1QgnzP2qiy1eTJEvS74CN9KLFIxGEcu9JVhWRhiMm0TdymM5yd
 3cfeQqyn02VRIlnMFgYeuY5lTGatRz0q9/FpsUIr2Z4Vx177/Vkeyq1v8VCRQDi5+4CW
 RwvrP0iF5g/VJVxa419lunCgPfyspzV3NyffutcfVmS5U8QuQAhIymaKxplCiBrT2twp
 OmeakvPd2bB/w5h5ppzoF0GZaodK8fobz4oPGpqAObu0SW+jgYF0h6mT8T6I8BzL3HrI
 mdgjUvvP65zs+PtYgT7ISdyecLU84kxPSztSHP+OwoIw/ZcuhrEyjPqRJD2dR9JFWKJT
 zfwA==
MIME-Version: 1.0
X-Received: by 10.180.208.80 with SMTP id mc16mr9804979wic.2.1381860924147;
 Tue, 15 Oct 2013 11:15:24 -0700 (PDT)
Received: by 10.227.207.129 with HTTP; Tue, 15 Oct 2013 11:15:24 -0700 (PDT)
In-Reply-To: <20131012001410.GA56872@funkthat.com>
References: <CAFPOs6pXhDjj1JTY0JNaw8g=zvtw9NgDVeJTQW-=31jwj321mQ@mail.gmail.com>
 <20131011215210.GY56872@funkthat.com>
 <72DA2C4F-44F0-456D-8679-A45CE617F8E6@gmail.com>
 <20131012001410.GA56872@funkthat.com>
Date: Tue, 15 Oct 2013 11:15:24 -0700
Message-ID: <CAFPOs6qArqRgTgr2d=KL6WRR0J6Es4GHgaU053-ya5m=rhJv=A@mail.gmail.com>
Subject: Re: [rfc] small bioq patch
From: Maksim Yevmenkin <maksim.yevmenkin@gmail.com>
To: Maksim Yevmenkin <maksim.yevmenkin@gmail.com>,
 Maksim Yevmenkin <emax@freebsd.org>, 
 "current@freebsd.org" <current@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Oct 2013 18:15:26 -0000

On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney <jmg@funkthat.com> wrote:
> Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700:
>> > On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <jmg@funkthat.com> wrote=
:
>> >
>> > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -070=
0:
>> >> i would like to submit the attached bioq patch for review and
>> >> comments. this is proof of concept. it helps with smoothing disk read
>> >> service times and arrear to eliminates outliers. please see attached
>> >> pictures (about a week worth of data)
>> >>
>> >> - c034 "control" unmodified system
>> >> - c044 patched system
>> >
>> > Can you describe how you got this data?  Were you using the gstat
>> > code or some other code?
>>
>> Yes, it's basically gstat data.
>
> The reason I ask this is that I don't think the data you are getting
> from gstat is what you think you are...  It accumulates time for a set
> of operations and then divides by the count...  So I'm not sure if the
> stat improvements you are seeing are as meaningful as you might think
> they are...

yes, i'm aware of it. however, i'm not aware of "better" tools. we
also use dtrace and PCM/PMC. ktrace is not particularly useable for us
because it does not really work well when we push system above 5 Gbps.
in order to actually see any "issues" we need to push system to 10
Gbps range at least.

>> >> graphs show max/avg disk read service times for both systems across 3=
6
>> >> spinning drives. both systems are relatively busy serving production
>> >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs
>> >> represent time when systems are refreshing their content, i.e. disks
>> >> are both reading and writing at the same time.
>> >
>> > Can you describe why you think this change makes an improvement?  Unle=
ss
>> > you're running 10k or 15k RPM drives, 128 seems like a large number.. =
as
>> > that's about halve number of IOPs that a normal HD handles in a second=
..
>>
>> Our (Netflix) load is basically random disk io. We have tweaked the syst=
em to ensure that our io path is "wide" enough, I.e. We read 1mb per disk i=
o for majority of the requests. However offsets we read from are all over t=
he place. It appears that we are getting into situation where larger offset=
s are getting delayed because smaller offsets are "jumping" ahead of them. =
Forcing bioq insert tail operation and effectively moving insertion point s=
eems to help avoiding getting into this situation. And, no. We don't use 10=
k or 15k drives. Just regular enterprise 7200 sata drives.
>
> I assume that the 1mb reads are then further broken up into 8 128kb
> reads? so it's more like every 16 reads in your work load that you
> insert the "ordered" io...

i'm not sure where 128kb comes from. are you referring to
MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB.

> I want to make sure that we choose the right value for this number..
> What number of IOPs are you seeing?

generally we see < 100 IOPs per disk on a system pushing 10+ Gbps.
i've experimented with different numbers on our system and i did not
see much of a difference on our workload. i'm up a value of 1024 now.
higher numbers seem to produce slightly bigger difference between
average and max time, but i do not think its statistically meaningful.
general shape of the curve remains smooth for all tried values so far.

[...]

>> > Also, do you see a similar throughput of the system?
>>
>> Yes. We do see almost identical throughput from both systems.  I have no=
t pushed the system to its limit yet, but having much smoother disk read se=
rvice time is important for us because we use it as one of the components o=
f system health metrics. We also need to ensure that disk io request is act=
ually dispatched to the disk in a timely manner.
>
> Per above, have you measured at the application layer that you are
> getting better latency times on your reads?  Maybe by doing a ktrace
> of the io, and calculating times between read and return or something
> like that...

ktrace is not particularly useful. i can see if i can come up with
dtrace probe or something. our application (or rather clients) are
_very_ sensitive to latency. having read service times outliers is not
very good for us.

> Have you looked at the geom disk schedulers work that Luigi did a few
> years back?  There have been known issues w/ our io scheduler for a
> long time...  If you search the mailing lists, you'll see lots of
> reports from some processes starving out others, probably due to a
> similar issue...  I've seen similar unfair behavior between processes,
> but spend time tracking it down...

yes, we have looked at it. it makes things worse for us, unfortunately.

> It does look like a good improvement though...
>
> Thanks for the work!

ok :) i'm interested to hear from people who have different workload
profile. for example lots of iops, i.e. very small files reads or
something like that.

thanks,
max