From owner-freebsd-scsi@FreeBSD.ORG Wed Nov 2 18:05:47 2011 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 052B910657BF for ; Wed, 2 Nov 2011 18:05:47 +0000 (UTC) (envelope-from nitroboost@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id 81EE18FC3C for ; Wed, 2 Nov 2011 18:05:46 +0000 (UTC) Received: by eyd10 with SMTP id 10so564610eyd.13 for ; Wed, 02 Nov 2011 11:05:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=YAh7VKceP+hZP6peyXqr0OXPQM7Jg3Sw6qQlfJzITNo=; b=efjfBifr+2k0APqYap7yT3wkbhCXzMCKZGfrGSf8a2jWt7RFP5mlXG3PMWGtLCSgqh 4igXugaPKf0Fw7/4VoJAeTkDHgxOshjPjunAGpQ1e2d1ph7x6lgS3BtLAQbd0TwCSC5U d4su8Ck+SRCXwCmLfP0wImLWco4CugsdDj+UU= MIME-Version: 1.0 Received: by 10.182.17.103 with SMTP id n7mr1100067obd.68.1320257145101; Wed, 02 Nov 2011 11:05:45 -0700 (PDT) Received: by 10.182.35.193 with HTTP; Wed, 2 Nov 2011 11:05:44 -0700 (PDT) In-Reply-To: References: Date: Wed, 2 Nov 2011 11:05:44 -0700 Message-ID: From: Jason Wolfe To: freebsd-scsi@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Re: mps/LSI SAS2008 controller crashes when smartctl is run with upped disk tags X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Nov 2011 18:05:47 -0000 On Tue, Nov 1, 2011 at 11:13 AM, Jason Wolfe wrote: > Luckily remote syslogging is enabled, so while nothing is kept locally, we > see these messages similar to these transmitted before the server hangs, > requiring a power cycle: > > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 510 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 713 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 942 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 356 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 492 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 976 > (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID > 339 > (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID > 746 > (da5:mps0:0:6:0): SCSI command timeout on device handle 0x000f SMID 74 > (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID > 613 > (da2:mps0:0:3:0): SCSI command timeout on device handle 0x000c SMID 16 > (da10:mps0:0:11:0): SCSI command timeout on device handle 0x0014 SMID > 305 > (da1:mps0:0:2:0): SCSI command timeout on device handle 0x000b SMID 74 > (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID > 594 > > In some cases that would be followed by this, which would usually be the > last transmission, though we don't see this in all cases. It may just be > the system isn't always alive long enough to transmit: > > kernel: mps0: IOC Fault 0x40006003, Resetting > > Hello, Testing with the LSI supplied driver, it appears they have a code path for this condition that causes our driver to crash. Here are 2 sets of messages: mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80003fb000 cm 0xffffff800040bdf8 (da0:mpslsi0:0:8:0): WRITE(10). CDB: 2a 0 55 bf 5a 3f 0 1 0 0 length 131072 SMID 97 command timeout cm 0xffffff800040bdf8 ccb 0xffffff00 mpslsi0: mpssas_alloc_tm freezing simq mpslsi0: timedout cm 0xffffff800040bdf8 allocated tm 0xffffff8000409070 (da0:mpslsi0:0:8:0): READ(10). CDB: 28 0 55 96 48 7f 0 0 80 0 length 65536 SMID 171 completed cm 0xffffff80004105a8 ccb 0xffffff03c3443y (da0:mpslsi0:0:8:0): READ(10). CDB: 28 0 54 f8 a4 3f 0 0 80 0 length 65536 SMID 762 completed cm 0xffffff8000434230 ccb 0xffffff001317ay (da0:mpslsi0:0:8:0): WRITE(10). CDB: 2a 0 55 bf 5a 3f 0 1 0 0 length 131072 SMID 97 completed timedout cm 0xffffff800040bdf8 ccb 0xffff1 (noperiph:mpslsi0:0:8:0): SMID 50 finished recovery after aborting TaskMID 97 mpslsi0: mpssas_free_tm releasing simq mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80003fb000 cm 0xffffff8000441e18 (da7:mpslsi0:0:15:0): WRITE(10). CDB: 2a 0 33 76 29 ef 0 1 0 0 length 131072 SMID 989 command timeout cm 0xffffff8000441e18 ccb 0xfffff0 mpslsi0: mpssas_alloc_tm freezing simq mpslsi0: timedout cm 0xffffff8000441e18 allocated tm 0xffffff80004063e0 (da7:mpslsi0:0:15:0): READ(10). CDB: 28 0 71 14 a1 4f 0 1 0 0 length 131072 SMID 857 completed cm 0xffffff8000439e38 ccb 0xffffff001316y (da7:mpslsi0:0:15:0): READ(10). CDB: 28 0 71 e4 98 57 0 0 80 0 length 65536 SMID 300 completed cm 0xffffff80004182a0 ccb 0xffffff0392f0y (da7:mpslsi0:0:15:0): WRITE(10). CDB: 2a 0 33 76 29 ef 0 1 0 0 length 131072 SMID 989 completed timedout cm 0xffffff8000441e18 ccb 0xff1 (noperiph:mpslsi0:0:15:0): SMID 4 finished recovery after aborting TaskMID 989 mpslsi0: mpssas_free_tm releasing simq The server ran for 10 minutes with these happening every 10-30 seconds, with our community driver the first instance of commands timing out during this smartctl storm would cause the server to hang and sometimes the controller to reset. Hopefully this is helpful to someone. Jason