From owner-freebsd-questions@FreeBSD.ORG  Fri May  4 15:21:06 2007
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
X-Original-To: freebsd-questions@freebsd.org
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 82B4016A400
	for <freebsd-questions@freebsd.org>;
	Fri,  4 May 2007 15:21:06 +0000 (UTC)
	(envelope-from dan@dan.emsphone.com)
Received: from dan.emsphone.com (dan.emsphone.com [199.67.51.101])
	by mx1.freebsd.org (Postfix) with ESMTP id 46C2313C447
	for <freebsd-questions@freebsd.org>;
	Fri,  4 May 2007 15:21:06 +0000 (UTC)
	(envelope-from dan@dan.emsphone.com)
Received: (from dan@localhost)
	by dan.emsphone.com (8.14.1/8.13.8) id l44FL5S8027420;
	Fri, 4 May 2007 10:21:05 -0500 (CDT) (envelope-from dan)
Date: Fri, 4 May 2007 10:21:05 -0500
From: Dan Nelson <dnelson@allantgroup.com>
To: Grant Peel <gpeel@thenetnow.com>
Message-ID: <20070504152105.GA18612@dan.emsphone.com>
References: <00e301c78e40$652f5380$6501a8c0@GRANT>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <00e301c78e40$652f5380$6501a8c0@GRANT>
X-OS: FreeBSD 6.2-STABLE
User-Agent: Mutt/1.5.15 (2007-04-06)
Cc: freebsd-questions@freebsd.org
Subject: Re: SCSI + camcontrol
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 04 May 2007 15:21:06 -0000

In the last episode (May 04), Grant Peel said:
>  A few weeks back, I turned on mod_gzip in apache and as a result,
>  the /tmp directory filled up with .wrk files causing the root
>  filesystem to fill to capacity. When we noticed what was happening,
>  on May 1 we had no choice but to cold boot the machine as it was,
>  for all purposes locked up.
> 
> In the security run, for May 1 and May 3 I am seeing the SCSI errors below.
> 
> FreeBSD 4.7 (yes we are going to upgrade soon (migrating to a newly setup machine)),
> Apache 1.3.26
> We do have complete dumps (From may1),
> The machine is a vintage 2003 Dell SC1400 
> HD = 1 Fujitsu SCSI that has never had problems before.
> 
> Questions:
> 
> Do the errors below TRUELY indicate pending doom?
> 
> Can camcontrol be used to squash the errors?
> 
> Should FSCK be used to fix?
> 
> Are these errors (the text below), formatted from the FreeBSD kernel
> or are they shown as reported by the HD subsystem? i.e. where can I
> go to read what the errors actually mean?

Those are errors reported by the drive:
 
> May  3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 7a df 0 0 80 0
> May  3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 asc:11,1
> May  3 03:59:14 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted sks:80,3f
>
> May  1 03:29:28 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 3 ab d5 c1 0 0 e 0
> May  1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:3abd5c1 asc:11,1
> May  1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted sks:80,3f

The drive has tried to read the indicated block number (0x4217b55 and
0x3abd5c1), and couldn't, even after multiple retries.  If it was able
to recover the data after retrying, it would have reallocated the block
to a spare sector.

There isn't an easy way to map a raw block number to a filename, but if
you can determine that the files belonging to the blocks were old, your
drive is probably still okay, and you happened to trip over some weak
spots on the disk that lost their data over time.  If they were
recently-generated files, then I'd start worrying about getting that
new system up as soon as possible.

One thing to try would be "dd if=/dev/da0 of=/dev/null bs=64k", and see
how many more errors get generated.  Installing smartmontools and
comparing the output of "smartctl -a /dev/da0" before and after will
also tell you how many ECC recoveries and rereads were done.

-- 
	Dan Nelson
	dnelson@allantgroup.com