From owner-freebsd-stable@FreeBSD.ORG Mon Oct 27 17:22:44 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3091E106566B for ; Mon, 27 Oct 2008 17:22:44 +0000 (UTC) (envelope-from v.haisman@sh.cvut.cz) Received: from service1.sh.cvut.cz (service1.sh.cvut.cz [147.32.127.214]) by mx1.freebsd.org (Postfix) with ESMTP id A12E08FC0C for ; Mon, 27 Oct 2008 17:22:43 +0000 (UTC) (envelope-from v.haisman@sh.cvut.cz) Received: from localhost (localhost [127.0.0.1]) by service1.sh.cvut.cz (Postfix) with ESMTP id 7D8C5123C13; Mon, 27 Oct 2008 18:22:12 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at service1.sh.cvut.cz X-Spam-Score: -86.599 X-Spam-Level: X-Spam-Status: No, score=-86.599 tagged_above=-255 required=5 tests=[ALL_TRUSTED=-1.44, AWL=13.651, CRM114_HAM_00=, DNS_FROM_SECURITYSAGE=2.001, MISSING_HEADERS=0.189, SMTPAUTH_SHDOMAIN=-100] Received: from service1.sh.cvut.cz ([127.0.0.1]) by localhost (service1.sh.cvut.cz [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id J+wIN2MSwmR0; Mon, 27 Oct 2008 18:22:03 +0100 (CET) Received: from 35.201.broadband4.iol.cz (35.201.broadband4.iol.cz [85.71.201.35]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: v.haisman@sh.cvut.cz) by service1.sh.cvut.cz (Postfix) with ESMTP id 80169123C0D; Mon, 27 Oct 2008 18:22:03 +0100 (CET) Message-ID: <4905F8BB.3080302@sh.cvut.cz> Date: Mon, 27 Oct 2008 18:22:03 +0100 From: Vaclav Haisman User-Agent: Thunderbird 2.0.0.17 (X11/20081017) MIME-Version: 1.0 CC: freebsd-stable@freebsd.org References: <4905951B.2050602@sh.cvut.cz> <20081027160828.GA24496@icarus.home.lan> In-Reply-To: <20081027160828.GA24496@icarus.home.lan> X-Enigmail-Version: 0.95.7 OpenPGP: id=63B6B297 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: Short SMART check causes disk op timeouts X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Oct 2008 17:22:44 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Jeremy Chadwick wrote: > On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> Hi, >> I have recently bought a new disk (Seagate 500G, ST3500320NS). I have >> enabled SMART checking using the smartmontools as usual for the disk >> (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem >> is that each time the test runs I get messages like the following in >> /var/log/messages: >> >> Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry >> left) LBA=836986454 >> Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0 >> retries left) LBA=836986454 >> Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out >> LBA=836986454 >> Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464, >> length=16384)]error = 5 >> >> And the SMART test results log on the disk contains line like this: >> >> # 1 Short offline Interrupted (host reset) 00% 297 >> - > > First and foremost, your above smartd.conf -s flags are conflicting. > Your long offline test will never get run on Sunday; the short will run > first, and the long won't ever start (because the short is already > running). I would recommend telling the short test to run only between > days 0-6, leaving Sunday solely for the long test. (I noticed this > because the above "Interrupted" test indicates a short test was > interrupted and not a long). Thanks, I have not noticed the overlap at all. > > Second, your short offline test runs at 0300, but the errors you're > seeing are at 0454 in the morning. A short offline test does not > take 2 hours to run -- they take between 2-10 minutes -- unless the > system is also in the middle of doing a lot of I/O, in which case the > short test will be suspended. > > There are cronjobs (specifically periodic jobs) that run starting at > 0301 in the morning ("periodic daily"), and many of those are I/O bound. > This could possibly extend the length of the short test until 0454. > > Weekly periodic jobs run at 0415 in the morning, on Sundays. These also > perform a lot of disk I/O, so it's possible that on Sunday specifically > the short SMART test gets pushed back quite some time. > > Third, the DMA timeouts you're seeing are possibly caused by the drive > taking too long when internally suspending the SMART test. > > In most cases, it's safe for SMART tests (short and long) to be run > while the machine is operational, and disk I/O requests are being > performed. When an I/O request comes and the disk is in the middle of > performing a SMART test, the drive has to stop the SMART test (e.g. > "suspend" it), complete the I/O request, then resume the SMART test. > > The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it > doesn't receive an acknowledgement back from the controller (disk) > within 5 seconds, it'll report a timeout on whatever operation it was > performing. I'm thinking the disk gets stuck in a "do the offline > test, no wait stop there's an I/O request, okay its done continue the > test, no way stop there's another I/O" loop. Can I make the timeout higher? For the sake of elimination. > > Another possibility is that your drive really *does* have a bad block at > LBA 836986454, and that one of those cron/periodic jobs is what's > noticing it, and that upon noticing a bad block, the drive more or less > aborts the SMART test to perform internal remapping of the block. > > To confirm this, you would need to boot the SeaTools utilities from DOS > or from a CD (see Seagate's site) and run a full sector scan (NOT the > "quick" test). This takes a few hours. Assuming it comes back clean, > then my above claim of the offline test taking too long to suspend is > probably the case. > > Possibly this is a firmware bug in the drive -- you might consider > mailing Seagate about this problem, although I'm doubting their Tier 1 > support will understand what the issue is. > > Is the block number always the same? Do you only see this error on > Sundays? These are two questions which might help narrow things down. Nope, the LBA is always different and I see it in the logs once every day. > >> This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC >> kernel. >> >> Now, does the timeout cause loss of any data? Is there anything besides >> disabling the testing that I can do about it? > > Do you understand what short and long offline tests actually do and what > they're used for? :-) If so, you'd know that running them periodically > is more or less silly (IMHO). I do not, not completely :) I think I have just copied the settings from somewhere and only just tweaked it a bit whenever I have added a disk. > > If you're trying to accomplish a cheap version of disk scrubbing, e.g. > scanning the entire disk for bad blocks and report them or have them > automatically remapped by the drive, consider using sysutils/diskcheckd, > which was made for this purpose. However, be aware of a problem I've > run into with it (still needs someone clueful to figure out why this > happens): > http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853 > > I do not advocate the use of periodic offline tests on disks, especially > at such aggressive intervals (daily). In fact, I don't even know why > Bruce added that option to smartd. There are only a few attributes in > SMART which get updated on offline tests, so I cease to see the point. > > You shouldn't be doing what you're doing, IMHO. If you want to do > these tests once every 2 weeks or once a month, that'd be a better idea. > Stick with the short test, and do it during a time when disk I/O is > very low (try something like 7am on a Saturday). Don't go with 2am > if your system/environment honours Daylight Saving Time, because that > could cause the test to run twice. Ok, I am taking the advice and I have set longer intervals of checking. Thanks for such extensive answer. - -- VH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iFYEAREIAAYFAkkF+LoACgkQhQBMvHf/WHmX3ADfTosXsJI0wAKl1MT7PCvBpmOm WnK9GavuuFsptwDgnjD0+tLGkZ2EEXjiXnvN/6wkz+wMWPCXYcHpGQ== =oDRL -----END PGP SIGNATURE-----