Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 2 Mar 2017 00:16:58 -0500
From:      Zaphod Beeblebrox <zbeeble@gmail.com>
To:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Disk controller heizenbug.
Message-ID:  <CACpH0Mdu7g2YCUphtZ_2P0T7-Ju9XH0QGoL-pSGei6nDQtpnvA@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
I have a disk controller.  I works in a modern AMD motherboard at home
(9590 processor), but when connected to a sunfire 4140 (opteron 2345 based
machine vintage 2008-ish) the disks spontaneously detach by just doing a
"zfs import"

The board has it's own mounting for the flash disks (two of them) and
probes as:

ahci0: <Marvell 88SE9230 AHCI SATA controller> port
0x8c00-0x8c07,0x8880-0x8883,0x8800-0x8807,0x8480-0x8483,0x8400-0x841f mem
0xdfbff800-0xdfbfffff irq 16 at device 0.0 numa-domain 0 on pci3

The disks show up as:

ada0 at ahcich0 bus 0 scbus6 target 0 lun 0
ada0: <Samsung SSD 850 EVO mSATA 250GB EMT41B6Q> ACS-2 ATA SATA 3.x device
ada0: Serial Number S248NXAH112465B
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 238475MB (488397168 512 byte sectors)
ada0: quirks=0x3<4K,NCQ_TRIM_BROKEN>

Under heavy bonnie++, they work in the AMD 9590 system.  On the opteron
machine, the following occurs:

ahcich1: Timeout on slot 11 port 0
ahcich1: is ffffffff cs ffffffff ss ffffffff rs 00000800 tfd ffffffff serr
ffffffff cmd ffffffff
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 90 e0 20 a0 40 17 00 00
00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command
ahcich1: stopping AHCI engine failed
ahcich0: ada1 at ahcich1 bus 0 scbus7 target 0 lun 0
Timeout on slot 31 port 0
ada1: ahcich0: <Samsung SSD 850 EVO mSATA 250GB EMT41B6Q>is ffffffff cs
ffffffff ss ffffffff rs 80000000 tfd ffffffff serr ffffffff cmd ffffffff
 s/n S248NXAH112471L detached
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 90 e0 20 a0 40 17 00 00
00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich0: stopping AHCI engine failed
ada0 at ahcich0 bus 0 scbus6 target 0 lun 0
ada0: <Samsung SSD 850 EVO mSATA 250GB EMT41B6Q> s/n S248NXAH112465B
detached
[2:43:343]root@yak:/usr/ports/net-mgmt/net-snmp> less /var/run/dmesg.boot
[2:44:344]root@yak:/usr/ports/net-mgmt/net-snmp> dmesg
pid 78200 (httpd), uid 80: exited on signal 11
ahcich1: Timeout on slot 11 port 0
ahcich1: is ffffffff cs ffffffff ss ffffffff rs 00000800 tfd ffffffff serr
ffffffff cmd ffffffff
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 90 e0 20 a0 40 17 00 00
00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command
ahcich1: stopping AHCI engine failed
ahcich0: ada1 at ahcich1 bus 0 scbus7 target 0 lun 0
Timeout on slot 31 port 0
ada1: ahcich0: <Samsung SSD 850 EVO mSATA 250GB EMT41B6Q>is ffffffff cs
ffffffff ss ffffffff rs 80000000 tfd ffffffff serr ffffffff cmd ffffffff
 s/n S248NXAH112471L detached
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 90 e0 20 a0 40 17 00 00
00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich0: stopping AHCI engine failed
ada0 at ahcich0 bus 0 scbus6 target 0 lun 0
ada0: <Samsung SSD 850 EVO mSATA 250GB EMT41B6Q> s/n S248NXAH112465B
detached

I'm posting here to hackers because this seems to violate layers --- on the
AMD machine ... it runs fine... even under load.  The SATA bus is local to
the card (and so travels with it to the server), yet the error looks like a
SATA BUS or drive error.

What gives?



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACpH0Mdu7g2YCUphtZ_2P0T7-Ju9XH0QGoL-pSGei6nDQtpnvA>