Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 11 Oct 2013 15:39:53 -0700
From:      Maksim Yevmenkin <maksim.yevmenkin@gmail.com>
To:        John-Mark Gurney <jmg@funkthat.com>
Cc:        Maksim Yevmenkin <emax@freebsd.org>, "current@freebsd.org" <current@freebsd.org>
Subject:   Re: [rfc] small bioq patch
Message-ID:  <72DA2C4F-44F0-456D-8679-A45CE617F8E6@gmail.com>
In-Reply-To: <20131011215210.GY56872@funkthat.com>
References:  <CAFPOs6pXhDjj1JTY0JNaw8g=zvtw9NgDVeJTQW-=31jwj321mQ@mail.gmail.com> <20131011215210.GY56872@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help


> On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <jmg@funkthat.com> wrote:
>=20
> Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
>> i would like to submit the attached bioq patch for review and
>> comments. this is proof of concept. it helps with smoothing disk read
>> service times and arrear to eliminates outliers. please see attached
>> pictures (about a week worth of data)
>>=20
>> - c034 "control" unmodified system
>> - c044 patched system
>=20
> Can you describe how you got this data?  Were you using the gstat
> code or some other code?

Yes, it's basically gstat data.=20

> Also, was your control system w/ the patch, but w/ the sysctl set to
> zero to possibly eliminate any code alignment issues?

Both systems use the same code base and build. Patched system has patch incl=
uded, "control" system does not have the patch. I can rerun my tests with sy=
sctl set to zero and use it as "control". So, the answer to your question is=
 "no".=20

>> graphs show max/avg disk read service times for both systems across 36
>> spinning drives. both systems are relatively busy serving production
>> traffic (about 10 Gbps at peak). grey shaded areas on the graphs
>> represent time when systems are refreshing their content, i.e. disks
>> are both reading and writing at the same time.
>=20
> Can you describe why you think this change makes an improvement?  Unless
> you're running 10k or 15k RPM drives, 128 seems like a large number.. as
> that's about halve number of IOPs that a normal HD handles in a second..

Our (Netflix) load is basically random disk io. We have tweaked the system t=
o ensure that our io path is "wide" enough, I.e. We read 1mb per disk io for=
 majority of the requests. However offsets we read from are all over the pla=
ce. It appears that we are getting into situation where larger offsets are g=
etting delayed because smaller offsets are "jumping" ahead of them. Forcing b=
ioq insert tail operation and effectively moving insertion point seems to he=
lp avoiding getting into this situation. And, no. We don't use 10k or 15k dr=
ives. Just regular enterprise 7200 sata drives.=20

> I assume you must be regularly seeing queue depths of 128+ for this
> code to make a difference, do you see that w/ gstat?

No, we don't see large (128+) queue sizes in gstat data. The way I see it, w=
e don't have to have deep queue here. We could just have a steady stream of i=
o requests where new, smaller, offsets consistently "jumping" ahead of older=
, larger offset. In fact gstat data show shallow queue of 5 or less items.

> Also, do you see a similar throughput of the system?

Yes. We do see almost identical throughput from both systems.  I have not pu=
shed the system to its limit yet, but having much smoother disk read service=
 time is important for us because we use it as one of the components of syst=
em health metrics. We also need to ensure that disk io request is actually d=
ispatched to the disk in a timely manner.=20

Thanks
Max




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?72DA2C4F-44F0-456D-8679-A45CE617F8E6>