Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Dec 2018 16:10:04 +0100
From:      Mark Martinec <Mark.Martinec+freebsd@ijs.si>
To:        freebsd-stable@freebsd.org
Cc:        Terry Kennedy <TERRY@glaver.org>
Subject:   Re: mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting
Message-ID:  <82fe444636a26d115bed4ba1b31198fc@ijs.si>
In-Reply-To: <01R19S451BX0002B96@glaver.org>
References:  <01R19S451BX0002B96@glaver.org>

next in thread | previous in thread | raw e-mail | index | archive | help
2018-12-26 22:26, Terry Kennedy wrote:
> The earlier LSI P20 releases were pretty flakey in some cases - try
> flashing 20.00.07.00.


Indeed.

I have upgraded LSI SAS2308 firmware from 20.00.02.00 to 20.00.07.00
a week ago, left it running for a while with 11.2, then upgraded again
to 12.0, and the controller is stable now, even with the new mps driver
that came with 12.0.

To recap:

  - mps driver from FreeBSD 11.2 and earlier is stable with SAS2308 
firmware
    20.00.02.00 _and_ 20.00.07.00

  - mps driver from FreeBSD 12.0 causes frequent controller resets
    with SAS2308 firmware 20.00.02.00 (and ZFS can't cope with that),
    but is stable with 20.00.07.00.

Mark




2018-12-17 16:52, je Mark Martinec napisal
> One of our servers that was upgraded from 11.2 to 12.0 (to RC2
> initially, then to RC3
> and lastly to a 12.0-RELEASE) is suffering severe instability of a
> disk controller,
> resetting itself a couple of times a day, usually associated with high
> disk usage
> (like poudriere buils or zfs scrub or nightly file system scans). The 
> same setup
> was rock-solid under 11.2 (and still/again is).
> 
> The disk controller is LSI SAS2308. It has four disks attached as 
> JBODs,
> one pair of SSDs and one pair of hard disks, each pair forming its own 
> zpool.
> A controller reset can occur regardless of which pair is in heavy use.
> 
> The following can be found in logs, just before machine becomes 
> unusable
> (although not logged always, as disks may be dropped before syslog has 
> a chance
> of writing anything):
> 
>   xxx kernel: [2382] mps0: IOC Fault 0x40000d04, Resetting
>   xxx kernel: [2382] mps0: Reinitializing controller
>   xxx kernel: [2383] mps0: Firmware: 20.00.02.00, Driver: 
> 21.02.00.00-fbsd
>   xxx kernel: [2383] mps0: IOCCapabilities:
> 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
>   xxx kernel: [2383] (da0:mps0:0:0:0): Invalidating pack
> 
> The IOC Fault location is always the same. Apparently the disk
> controller resets,
> all disk devices are dropped and ZFS finds itself with no disks. The
> machine still
> responds to ping, and if logged-in during the event and running zpool
> status -v 1,
> zfs reports loss of all devices for each pool:
> 
>   pool: data0
>  state: UNAVAIL
> status: One or more devices are faulted in response to IO failures.
>  action: Make sure the affected devices are connected, then run 'zpool 
> clear'.
>    see: http://illumos.org/msg/ZFS-8000-HC
>    scan: scrub repaired 0 in 0 days 03:53:41 with 0 errors on Sat Nov
> 17 00:22:38 2018
> config:
> 
>         NAME                      STATE     READ WRITE CKSUM
>         data0                     UNAVAIL      0     0     0
>           mirror-0                UNAVAIL      0    24     0
>              2396428274137360341   REMOVED      0     0     0  was
> /dev/gpt/da2-PN1334PCKAKD4S
>              16738407333921736610  REMOVED      0     0     0  was
> /dev/gpt/da3-PN2338P4GJ1XYC
> 
> (and similar for the other pool)
> 
> At this point the machine is unusable and needs to be hard-reset.
> 
> My guess is that after the controller resets, disk devices come up 
> again
> (according to the report seen on the console, stating 'periph 
> destroyed'
> first, then listing full info on each disk) - but zfs ignores them.
> 
> I don't see any mention of changes of the mps driver in the 12.0 
> release notes,
> although diff-ing its sources between 11.2 and 12.0 shows plenty of 
> nontrivial
> changes.
> 
> After suffering this instability for some time, I finally downgraded 
> the OS
> to 11.2, and things are back to normal again!
> 
> This downgrade path was nontrivial, as I have foolishly upgraded pool 
> features
> to what comes with 12.0, so downgrading involved hacking with 
> dismantling
> both zfs mirror pools, recreating pools without the two new features,
> zfs send/receive copying, while having a machine hang during some of
> these operations. Not something for the faint at heart. I know, foolish
> of me to upgrade pools after just one day of uptime with 12.0.
> 
> Some info on the controller:
> 
> kernel: mps0: <Avago Technologies (LSI) SAS2308> port 0xf000-0xf0ff
> mem 0xfbe40000-
>   0xfbe4ffff,0xfbe00000-0xfbe3ffff irq 64 at device 0.0 numa-domain 1 
> on pci11
> kernel: mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd
> 
> mpsutil shows:
> 
>   mps0 Adapter:
>     Board Name: LSI2308-IT
>     Board Assembly:
>     Chip Name: LSISAS2308
>     Chip Revision: ALL
>     BIOS Revision: 7.39.00.00
>     Firmware Revision: 20.00.02.00
>     Integrated RAID: no
> 
> 
> So, what has changed in the mps driver for this to be happening?
> Would it be possible to take mps driver sources from 11.2, transplant
> them to 12.0, recompile, and use that? Could the new mps driver be
> using some new feature of the controller and hits a firmware bug?
> I have resisted upgrading SAS2308 firmware and its BIOS, as it is
> working very well under 11.2.
> 
> Anyone else seen problems with mps driver and LSI SAS2308 controller?
> 
> (btw, on another machine the mps driver with LSI SAS2004 is working
> just fine under 12.0)
> 
>   Mark



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?82fe444636a26d115bed4ba1b31198fc>