From owner-freebsd-stable@FreeBSD.ORG Wed Feb 15 00:10:22 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EC61E106567C for ; Wed, 15 Feb 2012 00:10:22 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta03.emeryville.ca.mail.comcast.net (qmta03.emeryville.ca.mail.comcast.net [76.96.30.32]) by mx1.freebsd.org (Postfix) with ESMTP id CC0568FC1C for ; Wed, 15 Feb 2012 00:10:22 +0000 (UTC) Received: from omta17.emeryville.ca.mail.comcast.net ([76.96.30.73]) by qmta03.emeryville.ca.mail.comcast.net with comcast id ZyzN1i0091afHeLA30AN9c; Wed, 15 Feb 2012 00:10:22 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta17.emeryville.ca.mail.comcast.net with comcast id a0AM1i00i1t3BNj8d0AM6X; Wed, 15 Feb 2012 00:10:22 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id DEB2B102C1E; Tue, 14 Feb 2012 16:10:20 -0800 (PST) Date: Tue, 14 Feb 2012 16:10:20 -0800 From: Jeremy Chadwick To: Victor Balada Diaz Message-ID: <20120215001020.GA9386@icarus.home.lan> References: <20120214091909.GP2010@equilibrium.bsdes.net> <20120214100513.GA94501@icarus.home.lan> <20120214135435.GQ2010@equilibrium.bsdes.net> <20120214141601.GA98986@icarus.home.lan> <4F3A83DE.3000200@ambtec.de> <20120214165029.GA1852@icarus.home.lan> <4F3A971F.9040407@omnilan.de> <20120214221527.GT2010@equilibrium.bsdes.net> <20120214230958.GA8434@icarus.home.lan> <20120214233420.GU2010@equilibrium.bsdes.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120214233420.GU2010@equilibrium.bsdes.net> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Harald Schmalzbauer , Alexander Motin , freebsd-stable@freebsd.org, Claudius Herder Subject: Re: problems with AHCI on FreeBSD 8.2 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2012 00:10:23 -0000 On Wed, Feb 15, 2012 at 12:34:20AM +0100, Victor Balada Diaz wrote: > On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote: > > On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote: > > > On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote: > > > > schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime): > > > > > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote: > > > > >> Hello, > > > > >> > > > > >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still > > > > >> persists on FreeBSD 9.0 release. > > > > >> > > > > >> Switching from ahci to ataahci resolved the problem for me too. > > > > >> > > > > >> I'm using gmirror for swap, system is on a zpool and the problem first > > > > >> occurred during a zpool scrub, but it is easily reproducible with dd. > > > > >> > > > > >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1} > > > > >> of=/dev/null is not an issue. > > > > >> Sometimes I need to power off the server because after a reboot one disk > > > > >> is still missing. > > > > >> > > > > >> I really would like to help in this issue, so let me know if you need > > > > >> any more information. > > > > > I find it interesting that, at least so far, the only people reporting > > > > > problems of this type with the ahci.ko driver are people using Samsung > > > > > disks. The only difference is that your models are F1s while the OPs > > > > > are F2s. > > > > > > > > I saw such timeouts long ago and mav@ had a look at my postings and he > > > > mentioned it could be a NCQ problem. > > > > I suspected the disks firmware. > > > > I never tracked it down further, because after replacing the Samsung (F3 > > > > in that case) disks with hitachi ones solved all my problems and gave a > > > > big performance kick as well (with zfs). > > > > You can find the discussion here: > > > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html > > > > > > > > > > You gave me a good idea: try to disable NCQ and see if that's the fault. So > > > i went and applied the attached patch. After it, i can no longer reproduce > > > the issue with ahci driver. > > > > > > I know this is not a solution because it disables NCQ at controller level > > > instead of disk level, but at least we know for sure where the problem is. > > > > > > I think the solution would be to add a new quirk ADA_Q_NONCQ in sys/cam/ata/ata_da.c. > > > Quirks infraestructure is already built, so adding a new quirk for this seems > > > easy. > > > > > > Is someone interested? Do you think there is a better solution? > > > > > > If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and add my drives > > > to it. > > > > I took a stab at this, but I don't feel confident this is the proper > > solution/method. I worry there's some sort of chicken-or-the-egg > > condition here (quirk setup/matching comes *after* SATA capabilities > > detection), or that it makes the code messier. Need mav@'s > > recommendations on this. > > > > Below is for RELENG_8. I should note I haven't tested if this works, or > > even compiles -- normally I don't provide such patches without testing > > so I apologise in advance / user beware. > > You're amazingly fast. Thanks for all your help :) > > You start applying the quirks before > > snprintf(announce_buf, sizeof(announce_buf), > "kern.cam.ada.%d.quirks", periph->unit_number); > quirks = softc->quirks; > TUNABLE_INT_FETCH(announce_buf, &quirks); > > So you're breaking quirk setting at boot time. I'm too tired to quite understand (in full) what's wrong with my patch, but I think you're referring to situations where someone would have kern.cam.ada.X.quirks set in loader.conf? If so, I believe that same situation would happen presently if someone set kern.cam.ada.X.quirks in their loader.conf to a value that did not contain bit #0 set to 1, and used one of the 4K sector disks listed in ada_quirk_table -- what's in loader.conf looks like it would overwrite whatever the kernel code bits chose automatically: 910 match = cam_quirkmatch((caddr_t)&cgd->ident_data, 911 (caddr_t)ada_quirk_table, 912 sizeof(ada_quirk_table)/sizeof(*ada_quirk_table), 913 sizeof(*ada_quirk_table), ata_identify_match); 914 if (match != NULL) 915 softc->quirks = ((struct ada_quirk_entry *)match)->quirks; 916 else 917 softc->quirks = ADA_Q_NONE; ... 931 snprintf(announce_buf, sizeof(announce_buf), 932 "kern.cam.ada.%d.quirks", periph->unit_number); 933 quirks = softc->quirks; 934 TUNABLE_INT_FETCH(announce_buf, &quirks); 935 softc->quirks = quirks; I read this to mean: Lines 910-917 -- if there's a device ID string match in ada_quirk_table, set softc->quirks to the content of that struct entry. So, for example, 4K sector disks would set softc->quirks to 0x01. Lines 931-933 -- assign the "quirks" variable to what softc->quirks currently contains. Thus, for 4K sector disks, quirks = 0x01. Line 934 -- load into "quirks" variable the contents of loader.conf entries pertaining to kern.cam.ada.N.quirks, if set. If someone had an entry in loader.conf saying kern.cam.ada.N.quirks=0 then yes, this would overwrite what the kernel "auto-chose". Line 935 -- assign softc->quirks = quirks, thus softc->quirks = 0x00, assuming someone set it to such in loader.conf. > See my attached patch. I can confirm it works for me. I thought of taking that approach, but for me it's "dirty". Here's what I mean by that: ADA_FLAG_CAN_NCQ gets set in softc->flags around line 892, but then roughly a hundred lines later, you clear the exact same flag you just set (based on quirk conditionals). I dunno how people feel about that, but my impression is that it's dirty/confusing. My opinion is to only set the bit once and not mess about with repeated if() statements that set it, then clear it, etc... -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |