From owner-freebsd-current@FreeBSD.ORG Tue Oct 15 18:15:26 2013 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 47A131F1; Tue, 15 Oct 2013 18:15:26 +0000 (UTC) (envelope-from maksim.yevmenkin@gmail.com) Received: from mail-wg0-x22d.google.com (mail-wg0-x22d.google.com [IPv6:2a00:1450:400c:c00::22d]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9A2442D34; Tue, 15 Oct 2013 18:15:25 +0000 (UTC) Received: by mail-wg0-f45.google.com with SMTP id z12so7115718wgg.0 for ; Tue, 15 Oct 2013 11:15:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=BFvy5l1swFpCN0sNLb7nsOX7ZtdmmrbD/jdJJMt0A8g=; b=oKyEfksYvtwpRxTh1QgnzP2qiy1eTJEvS74CN9KLFIxGEcu9JVhWRhiMm0TdymM5yd 3cfeQqyn02VRIlnMFgYeuY5lTGatRz0q9/FpsUIr2Z4Vx177/Vkeyq1v8VCRQDi5+4CW RwvrP0iF5g/VJVxa419lunCgPfyspzV3NyffutcfVmS5U8QuQAhIymaKxplCiBrT2twp OmeakvPd2bB/w5h5ppzoF0GZaodK8fobz4oPGpqAObu0SW+jgYF0h6mT8T6I8BzL3HrI mdgjUvvP65zs+PtYgT7ISdyecLU84kxPSztSHP+OwoIw/ZcuhrEyjPqRJD2dR9JFWKJT zfwA== MIME-Version: 1.0 X-Received: by 10.180.208.80 with SMTP id mc16mr9804979wic.2.1381860924147; Tue, 15 Oct 2013 11:15:24 -0700 (PDT) Received: by 10.227.207.129 with HTTP; Tue, 15 Oct 2013 11:15:24 -0700 (PDT) In-Reply-To: <20131012001410.GA56872@funkthat.com> References: <20131011215210.GY56872@funkthat.com> <72DA2C4F-44F0-456D-8679-A45CE617F8E6@gmail.com> <20131012001410.GA56872@funkthat.com> Date: Tue, 15 Oct 2013 11:15:24 -0700 Message-ID: Subject: Re: [rfc] small bioq patch From: Maksim Yevmenkin To: Maksim Yevmenkin , Maksim Yevmenkin , "current@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Oct 2013 18:15:26 -0000 On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney wrote: > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700: >> > On Oct 11, 2013, at 2:52 PM, John-Mark Gurney wrote= : >> > >> > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -070= 0: >> >> i would like to submit the attached bioq patch for review and >> >> comments. this is proof of concept. it helps with smoothing disk read >> >> service times and arrear to eliminates outliers. please see attached >> >> pictures (about a week worth of data) >> >> >> >> - c034 "control" unmodified system >> >> - c044 patched system >> > >> > Can you describe how you got this data? Were you using the gstat >> > code or some other code? >> >> Yes, it's basically gstat data. > > The reason I ask this is that I don't think the data you are getting > from gstat is what you think you are... It accumulates time for a set > of operations and then divides by the count... So I'm not sure if the > stat improvements you are seeing are as meaningful as you might think > they are... yes, i'm aware of it. however, i'm not aware of "better" tools. we also use dtrace and PCM/PMC. ktrace is not particularly useable for us because it does not really work well when we push system above 5 Gbps. in order to actually see any "issues" we need to push system to 10 Gbps range at least. >> >> graphs show max/avg disk read service times for both systems across 3= 6 >> >> spinning drives. both systems are relatively busy serving production >> >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs >> >> represent time when systems are refreshing their content, i.e. disks >> >> are both reading and writing at the same time. >> > >> > Can you describe why you think this change makes an improvement? Unle= ss >> > you're running 10k or 15k RPM drives, 128 seems like a large number.. = as >> > that's about halve number of IOPs that a normal HD handles in a second= .. >> >> Our (Netflix) load is basically random disk io. We have tweaked the syst= em to ensure that our io path is "wide" enough, I.e. We read 1mb per disk i= o for majority of the requests. However offsets we read from are all over t= he place. It appears that we are getting into situation where larger offset= s are getting delayed because smaller offsets are "jumping" ahead of them. = Forcing bioq insert tail operation and effectively moving insertion point s= eems to help avoiding getting into this situation. And, no. We don't use 10= k or 15k drives. Just regular enterprise 7200 sata drives. > > I assume that the 1mb reads are then further broken up into 8 128kb > reads? so it's more like every 16 reads in your work load that you > insert the "ordered" io... i'm not sure where 128kb comes from. are you referring to MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB. > I want to make sure that we choose the right value for this number.. > What number of IOPs are you seeing? generally we see < 100 IOPs per disk on a system pushing 10+ Gbps. i've experimented with different numbers on our system and i did not see much of a difference on our workload. i'm up a value of 1024 now. higher numbers seem to produce slightly bigger difference between average and max time, but i do not think its statistically meaningful. general shape of the curve remains smooth for all tried values so far. [...] >> > Also, do you see a similar throughput of the system? >> >> Yes. We do see almost identical throughput from both systems. I have no= t pushed the system to its limit yet, but having much smoother disk read se= rvice time is important for us because we use it as one of the components o= f system health metrics. We also need to ensure that disk io request is act= ually dispatched to the disk in a timely manner. > > Per above, have you measured at the application layer that you are > getting better latency times on your reads? Maybe by doing a ktrace > of the io, and calculating times between read and return or something > like that... ktrace is not particularly useful. i can see if i can come up with dtrace probe or something. our application (or rather clients) are _very_ sensitive to latency. having read service times outliers is not very good for us. > Have you looked at the geom disk schedulers work that Luigi did a few > years back? There have been known issues w/ our io scheduler for a > long time... If you search the mailing lists, you'll see lots of > reports from some processes starving out others, probably due to a > similar issue... I've seen similar unfair behavior between processes, > but spend time tracking it down... yes, we have looked at it. it makes things worse for us, unfortunately. > It does look like a good improvement though... > > Thanks for the work! ok :) i'm interested to hear from people who have different workload profile. for example lots of iops, i.e. very small files reads or something like that. thanks, max