From owner-freebsd-questions@FreeBSD.ORG Fri May 4 15:21:06 2007 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 82B4016A400 for ; Fri, 4 May 2007 15:21:06 +0000 (UTC) (envelope-from dan@dan.emsphone.com) Received: from dan.emsphone.com (dan.emsphone.com [199.67.51.101]) by mx1.freebsd.org (Postfix) with ESMTP id 46C2313C447 for ; Fri, 4 May 2007 15:21:06 +0000 (UTC) (envelope-from dan@dan.emsphone.com) Received: (from dan@localhost) by dan.emsphone.com (8.14.1/8.13.8) id l44FL5S8027420; Fri, 4 May 2007 10:21:05 -0500 (CDT) (envelope-from dan) Date: Fri, 4 May 2007 10:21:05 -0500 From: Dan Nelson To: Grant Peel Message-ID: <20070504152105.GA18612@dan.emsphone.com> References: <00e301c78e40$652f5380$6501a8c0@GRANT> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <00e301c78e40$652f5380$6501a8c0@GRANT> X-OS: FreeBSD 6.2-STABLE User-Agent: Mutt/1.5.15 (2007-04-06) Cc: freebsd-questions@freebsd.org Subject: Re: SCSI + camcontrol X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 May 2007 15:21:06 -0000 In the last episode (May 04), Grant Peel said: > A few weeks back, I turned on mod_gzip in apache and as a result, > the /tmp directory filled up with .wrk files causing the root > filesystem to fill to capacity. When we noticed what was happening, > on May 1 we had no choice but to cold boot the machine as it was, > for all purposes locked up. > > In the security run, for May 1 and May 3 I am seeing the SCSI errors below. > > FreeBSD 4.7 (yes we are going to upgrade soon (migrating to a newly setup machine)), > Apache 1.3.26 > We do have complete dumps (From may1), > The machine is a vintage 2003 Dell SC1400 > HD = 1 Fujitsu SCSI that has never had problems before. > > Questions: > > Do the errors below TRUELY indicate pending doom? > > Can camcontrol be used to squash the errors? > > Should FSCK be used to fix? > > Are these errors (the text below), formatted from the FreeBSD kernel > or are they shown as reported by the HD subsystem? i.e. where can I > go to read what the errors actually mean? Those are errors reported by the drive: > May 3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 7a df 0 0 80 0 > May 3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 asc:11,1 > May 3 03:59:14 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted sks:80,3f > > May 1 03:29:28 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 3 ab d5 c1 0 0 e 0 > May 1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:3abd5c1 asc:11,1 > May 1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted sks:80,3f The drive has tried to read the indicated block number (0x4217b55 and 0x3abd5c1), and couldn't, even after multiple retries. If it was able to recover the data after retrying, it would have reallocated the block to a spare sector. There isn't an easy way to map a raw block number to a filename, but if you can determine that the files belonging to the blocks were old, your drive is probably still okay, and you happened to trip over some weak spots on the disk that lost their data over time. If they were recently-generated files, then I'd start worrying about getting that new system up as soon as possible. One thing to try would be "dd if=/dev/da0 of=/dev/null bs=64k", and see how many more errors get generated. Installing smartmontools and comparing the output of "smartctl -a /dev/da0" before and after will also tell you how many ECC recoveries and rereads were done. -- Dan Nelson dnelson@allantgroup.com