From owner-freebsd-questions@FreeBSD.ORG Mon Feb 6 18:51:02 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9A3D71065675 for ; Mon, 6 Feb 2012 18:51:02 +0000 (UTC) (envelope-from cswiger@mac.com) Received: from asmtpout026.mac.com (asmtpout026.mac.com [17.148.16.101]) by mx1.freebsd.org (Postfix) with ESMTP id 7D3D58FC08 for ; Mon, 6 Feb 2012 18:51:02 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Received: from cswiger1.apple.com (unknown [17.209.4.71]) by asmtp026.mac.com (Oracle Communications Messaging Server 7u4-23.01 (7.0.4.23.0) 64bit (built Aug 10 2011)) with ESMTPSA id <0LYZ00ILOJ0Y0260@asmtp026.mac.com> for freebsd-questions@freebsd.org; Mon, 06 Feb 2012 10:50:59 -0800 (PST) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.6.7361,1.0.260,0.0.0000 definitions=2012-02-06_04:2012-02-06, 2012-02-06, 1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 suspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=6.0.2-1012030000 definitions=main-1202060169 From: Chuck Swiger In-reply-to: <2187B4E2EDE5044CA48617AC0C8D6E1B0E2AFA38@MBX021-W3-CA-1.exch021.domain.local> Date: Mon, 06 Feb 2012 10:50:57 -0800 Message-id: References: <2187B4E2EDE5044CA48617AC0C8D6E1B0E2AFA38@MBX021-W3-CA-1.exch021.domain.local> To: Ryan Merrell X-Mailer: Apple Mail (2.1084) Cc: "freebsd-questions@freebsd.org" Subject: Re: Multiple errors on server -- Where do I start looking? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Feb 2012 18:51:02 -0000 On Feb 6, 2012, at 8:15 AM, Ryan Merrell wrote: > We have an Intel modular blade server. The chassis has 2x 3-disk RAID(5) arrays. Volume 1 is what the OS (FreeBSD 7.2) is installed on and Volume 2 is mounted at /usr. These two volumes are da0 and da1. This doesn't matter directly to your issue, but a 3-disk RAID-5 setup is not a great choice. With six disks available, you'd almost certainly do better either as a 6-disk-wide RAID-5 or a RAID-10. > I got email notifications saying the web host I run in a jail hosted on this server was down. I try to SSH into it, but it fails. I ping it and I get a 50% return rate. So I log in to the management blade and start a virtual KVM sessions to get into the blade. Once I'm into the basehost blade, I cat dmesg.today and get a slew of errors. Here we go.. > (da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state > (da3:mpt0:0:6:1): Retrying Command (per Sense Data) > (da3:mpt0:0:6:1): READ(10). CDB: 28 0 0 0 0 0 0 0 1 0 > (da3:mpt0:0:6:1): CAM Status: SCSI Status Error > (da3:mpt0:0:6:1): SCSI Status: Check Condition > (da3:mpt0:0:6:1): ILLEGAL REQUEST asc:4,b > (da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state > (da3:mpt0:0:6:1): Retrying Command (per Sense Data) > (da3:mpt0:0:6:1): READ(10). CDB: 28 0 0 0 0 0 0 0 1 0 > (da3:mpt0:0:6:1): CAM Status: SCSI Status Error > (da3:mpt0:0:6:1): SCSI Status: Check Condition > (da3:mpt0:0:6:1): ILLEGAL REQUEST asc:4,b > (da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state > (da3:mpt0:0:6:1): Retries Exhausted > > As mentioned before, our two volumes are da0 and da1. /dev lists da2 and da3 as well, but I have no idea what they are. How do I figure out what da3 is and what do the above error messages say about it? Someone on the forum asked me if the two volumes are on the same controller and the answer is yes, they are. Check a dmesg after a reboot, or take a look at "camcontrol devlist" or "atacontrol list" and that ought to provide more information. Since you're also using GEOM labels, "glabel status" is likely to be informative as well. > GEOM_LABEL: Label for provider da0s1a is ufsid/4aeb03874c64d9f1. > GEOM_LABEL: Label for provider da0s1d is ufsid/4aeb038ae8ae24cf. > GEOM_LABEL: Label for provider da0s1e is ufsid/4aeb0387d999941a. > GEOM_LABEL: Label for provider da0s1f is ufsid/4aeb038766c4c807. > Trying to mount root from ufs:/dev/da0s1a > GEOM_LABEL: Label ufsid/4aeb03874c64d9f1 removed. > GEOM_LABEL: Label for provider da0s1a is ufsid/4aeb03874c64d9f1. > GEOM_LABEL: Label ufsid/4aeb0387d999941a removed. > GEOM_LABEL: Label ufsid/4bd2077f23a6cc93 removed. > GEOM_LABEL: Label for provider da0s1e is ufsid/4aeb0387d999941a. > GEOM_LABEL: Label for provider da1s1 is ufsid/4bd2077f23a6cc93. > GEOM_LABEL: Label ufsid/4aeb038766c4c807 removed. > GEOM_LABEL: Label for provider da0s1f is ufsid/4aeb038766c4c807. > GEOM_LABEL: Label ufsid/4aeb038ae8ae24cf removed. > GEOM_LABEL: Label for provider da0s1d is ufsid/4aeb038ae8ae24cf. > GEOM_LABEL: Label ufsid/4aeb03874c64d9f1 removed. > GEOM_LABEL: Label ufsid/4aeb0387d999941a removed. > GEOM_LABEL: Label ufsid/4aeb038766c4c807 removed. > GEOM_LABEL: Label ufsid/4aeb038ae8ae24cf removed. > GEOM_LABEL: Label ufsid/4bd2077f23a6cc93 removed. > > Was root unmounted? Whats going on here? Obviously there's some issue with da0, which is mounted at /. The server has been up and running fine, so why am I seeing "Trying to mount root from ufs:/dev/da0s1a"? These are standard messages from GEOM-- it's trying to look at the disk labels and figure out where to mount the various filesystems. > pid 93248 (httpd), uid 80: exited on signal 10 > pid 95624 (httpd), uid 80: exited on signal 10 > pid 97956 (httpd), uid 80: exited on signal 10 > pid 97935 (httpd), uid 80: exited on signal 10 > pid 96603 (httpd), uid 80: exited on signal 10 > pid 93210 (httpd), uid 80: exited on signal 10 > pid 98246 (httpd), uid 80: exited on signal 10 > > This is apparently whats killing our webserver. Apache receives a signal 10 and quits.. Everything I've read says it's an issue with Apache trying to access RAM that it shouldn't or that doesn't exist.. Is there something else with the above da0 or da3 errors that would cause a SIGBUS on httpd? That's unclear, but normally a failing disk will cause I/O to block and the httpds will simply hang, not crash. Most likely, you've got a bug lurking in one of the Apache modules you use (mod_php is a likely candidate), but run a test instance of httpd under gdb using -X flag, and see whether you can gain better information. Or unlimit coredumpsize, and run gdb against the corefile to see what's causing the crash. Regards, -- -Chuck