From owner-freebsd-stable@freebsd.org Thu Dec 27 15:10:11 2018 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 90A8A13621CC for ; Thu, 27 Dec 2018 15:10:11 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: from mail.ijs.si (mail.ijs.si [IPv6:2001:1470:ff80::25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D052C77AAC for ; Thu, 27 Dec 2018 15:10:10 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: from amavis-ori.ijs.si (localhost [IPv6:::1]) by mail.ijs.si (Postfix) with ESMTP id 43QYDH3H6xzqwL; Thu, 27 Dec 2018 16:10:07 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ijs.si; h= user-agent:message-id:references:in-reply-to:organization :subject:subject:from:from:date:date:content-transfer-encoding :content-type:content-type:mime-version:received:received :received:received; s=jakla4; t=1545923405; x=1548515406; bh=9vX BjbyJxmgGNNrLKyWenhQJ4QQ3yKh5nAypgqfB99I=; b=WFY8Anc7uBYNAeTXNKe Vw3WpU5m6U06zF+4EG1248QwP3jEIGHUrURtDEcUN0yz7KtM3KjP/23QKRHcWqcG 3Kf3tNYo0EeAhZ+whFD0RTQPayjMfWrzQXA9d5Tfd+wXmF7ArXKehN1fasaKlZPR weHRmFSoS/rf3EgWgZVVH4/s= X-Virus-Scanned: amavisd-new at ijs.si Received: from mail.ijs.si ([IPv6:::1]) by amavis-ori.ijs.si (mail.ijs.si [IPv6:::1]) (amavisd-new, port 10026) with LMTP id sDUiHtFxlSUM; Thu, 27 Dec 2018 16:10:05 +0100 (CET) Received: from mildred.ijs.si (mailbox.ijs.si [IPv6:2001:1470:ff80::143:1]) by mail.ijs.si (Postfix) with ESMTP id 43QYDD5375zqwD; Thu, 27 Dec 2018 16:10:04 +0100 (CET) Received: from nabiralnik.ijs.si (nabiralnik.ijs.si [IPv6:2001:1470:ff80::80:16]) by mildred.ijs.si (Postfix) with ESMTP id 43QYDD2V4Mzmn; Thu, 27 Dec 2018 16:10:04 +0100 (CET) Received: from neli.ijs.si (2001:1470:ff80:88:21c:c0ff:feb1:8c91) by nabiralnik.ijs.si with HTTP (HTTP/1.1 POST); Thu, 27 Dec 2018 16:10:04 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Thu, 27 Dec 2018 16:10:04 +0100 From: Mark Martinec To: freebsd-stable@freebsd.org Cc: Terry Kennedy Subject: Re: mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting Organization: Jozef Stefan Institute In-Reply-To: <01R19S451BX0002B96@glaver.org> References: <01R19S451BX0002B96@glaver.org> Message-ID: <82fe444636a26d115bed4ba1b31198fc@ijs.si> X-Sender: Mark.Martinec+freebsd@ijs.si User-Agent: Roundcube Webmail/1.3.1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2018 15:10:11 -0000 2018-12-26 22:26, Terry Kennedy wrote: > The earlier LSI P20 releases were pretty flakey in some cases - try > flashing 20.00.07.00. Indeed. I have upgraded LSI SAS2308 firmware from 20.00.02.00 to 20.00.07.00 a week ago, left it running for a while with 11.2, then upgraded again to 12.0, and the controller is stable now, even with the new mps driver that came with 12.0. To recap: - mps driver from FreeBSD 11.2 and earlier is stable with SAS2308 firmware 20.00.02.00 _and_ 20.00.07.00 - mps driver from FreeBSD 12.0 causes frequent controller resets with SAS2308 firmware 20.00.02.00 (and ZFS can't cope with that), but is stable with 20.00.07.00. Mark 2018-12-17 16:52, je Mark Martinec napisal > One of our servers that was upgraded from 11.2 to 12.0 (to RC2 > initially, then to RC3 > and lastly to a 12.0-RELEASE) is suffering severe instability of a > disk controller, > resetting itself a couple of times a day, usually associated with high > disk usage > (like poudriere buils or zfs scrub or nightly file system scans). The > same setup > was rock-solid under 11.2 (and still/again is). > > The disk controller is LSI SAS2308. It has four disks attached as > JBODs, > one pair of SSDs and one pair of hard disks, each pair forming its own > zpool. > A controller reset can occur regardless of which pair is in heavy use. > > The following can be found in logs, just before machine becomes > unusable > (although not logged always, as disks may be dropped before syslog has > a chance > of writing anything): > > xxx kernel: [2382] mps0: IOC Fault 0x40000d04, Resetting > xxx kernel: [2382] mps0: Reinitializing controller > xxx kernel: [2383] mps0: Firmware: 20.00.02.00, Driver: > 21.02.00.00-fbsd > xxx kernel: [2383] mps0: IOCCapabilities: > 5a85c > xxx kernel: [2383] (da0:mps0:0:0:0): Invalidating pack > > The IOC Fault location is always the same. Apparently the disk > controller resets, > all disk devices are dropped and ZFS finds itself with no disks. The > machine still > responds to ping, and if logged-in during the event and running zpool > status -v 1, > zfs reports loss of all devices for each pool: > > pool: data0 > state: UNAVAIL > status: One or more devices are faulted in response to IO failures. > action: Make sure the affected devices are connected, then run 'zpool > clear'. > see: http://illumos.org/msg/ZFS-8000-HC > scan: scrub repaired 0 in 0 days 03:53:41 with 0 errors on Sat Nov > 17 00:22:38 2018 > config: > > NAME STATE READ WRITE CKSUM > data0 UNAVAIL 0 0 0 > mirror-0 UNAVAIL 0 24 0 > 2396428274137360341 REMOVED 0 0 0 was > /dev/gpt/da2-PN1334PCKAKD4S > 16738407333921736610 REMOVED 0 0 0 was > /dev/gpt/da3-PN2338P4GJ1XYC > > (and similar for the other pool) > > At this point the machine is unusable and needs to be hard-reset. > > My guess is that after the controller resets, disk devices come up > again > (according to the report seen on the console, stating 'periph > destroyed' > first, then listing full info on each disk) - but zfs ignores them. > > I don't see any mention of changes of the mps driver in the 12.0 > release notes, > although diff-ing its sources between 11.2 and 12.0 shows plenty of > nontrivial > changes. > > After suffering this instability for some time, I finally downgraded > the OS > to 11.2, and things are back to normal again! > > This downgrade path was nontrivial, as I have foolishly upgraded pool > features > to what comes with 12.0, so downgrading involved hacking with > dismantling > both zfs mirror pools, recreating pools without the two new features, > zfs send/receive copying, while having a machine hang during some of > these operations. Not something for the faint at heart. I know, foolish > of me to upgrade pools after just one day of uptime with 12.0. > > Some info on the controller: > > kernel: mps0: port 0xf000-0xf0ff > mem 0xfbe40000- > 0xfbe4ffff,0xfbe00000-0xfbe3ffff irq 64 at device 0.0 numa-domain 1 > on pci11 > kernel: mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd > > mpsutil shows: > > mps0 Adapter: > Board Name: LSI2308-IT > Board Assembly: > Chip Name: LSISAS2308 > Chip Revision: ALL > BIOS Revision: 7.39.00.00 > Firmware Revision: 20.00.02.00 > Integrated RAID: no > > > So, what has changed in the mps driver for this to be happening? > Would it be possible to take mps driver sources from 11.2, transplant > them to 12.0, recompile, and use that? Could the new mps driver be > using some new feature of the controller and hits a firmware bug? > I have resisted upgrading SAS2308 firmware and its BIOS, as it is > working very well under 11.2. > > Anyone else seen problems with mps driver and LSI SAS2308 controller? > > (btw, on another machine the mps driver with LSI SAS2004 is working > just fine under 12.0) > > Mark