FreeBSD Mail Archives

Date:      Tue, 7 Jun 2016 20:53:10 +0100
From:      Steven Hartland <killing@multiplay.co.uk>
To:        freebsd-scsi@freebsd.org
Subject:   Re: Avago LSI SAS 3008 & Intel SSD Timeouts
Message-ID:  <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk>
In-Reply-To: <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
References:  <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>

CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO 
delete but assuming your running ZFS can you set the following in 
loader.conf and see how you get on.
vfs.zfs.trim.enabled=0

     Regards
     Steve


On 07/06/2016 20:24, list-news wrote:
> I have additional confirmation that it's not faulty hardware.
>
> I moved the 4 disks that carry the postgresql database over to another 
> server (same model - TWIN 2028-DECR).  Mounted the zpool and fired up 
> my application.
>
> This server is using a much earlier firmware on the SAS controller.  
> Different CPU / Memory / etc.
>
> Errors happen within the first couple minutes, and continue every few 
> minutes (notice time-stamps for each drive timeout every few minutes):
>
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 0e 74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 0e 74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
> length 512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 33 91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 36 f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc 
> 804b scsi 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 36 ea dc 60 00 00 08 00
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command 
> timeout
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 0e 74 79 e0 00 00 08 00
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI 
> Status Error
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check 
> Condition
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per 
> sense data)
> Jun  7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4 
> Aborting command 0xfffffe0000be0140
> Jun  7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 10
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
> length 512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19 
> be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 
> 3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 
> 3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 
> 3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc 
> 804b scsi 0 state c xfer 0
> Jun  7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35 
> 26 ca f0 00 00 08 00
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command 
> timeout
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f6 ee f0 00 00 08 00
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status 
> Error
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check 
> Condition
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per 
> sense data)
> Jun  7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5 
> Aborting command 0xfffffe0000bfcca0
> Jun  7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 10
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
> length 512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 
> 8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 
> 6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc 
> 804b scsi 0 state c xfer 0
> Jun  7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e 
> d6 de f0 00 00 08 00
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command 
> timeout
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: mpr0: log_info(0x31120440): 
> originator(PL), code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> Jun  7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), 
> code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440): 
> originator(PL), code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440): 
> originator(PL), code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: 0): mpr0: Retrying command
> Jun  7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), 
> code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun  7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB 
> request completed with an error
> Jun  7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), 
> code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440): 
> originator(PL), code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: 10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 
> 8d 99 48 00 00 08 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 
> 6b df b8 00 00 10 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f7 cd a8 00 00 08 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f7 cd b0 00 00 08 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e 
> d6 de f0 00 00 08 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status 
> Error
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check 
> Condition
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack
> Jun  7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6 
> Aborting command 0xfffffe0000c1e960
> Jun  7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 11
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
> length 512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23 
> 7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31 
> bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 
> 80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0 
> state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00 
> 10 00
> Jun  7 13:15:12 s17 kernel: 0
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command 
> timeout
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 
> 7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0 
> state c xfer (da3:0
> Jun  7 13:15:12 s17 kernel: mpr0:0:    (da3:mpr0:0:11:0): SYNCHRONIZE 
> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441 
> terminated ioc 804b scsi 0 sta11:te c xfer 0
> Jun  7 13:15:12 s17 kernel: 0): mpr0: Retrying command
> Jun  7 13:15:12 s17 kernel: Unfreezing devq for target ID 11
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status 
> Error
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check 
> Condition
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per 
> sense data)
>
> gstat output:
> (I'm guessing I caught this during the da2 error)
>
> #gstat -do
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w d/s kBps   
> ms/d    o/s   ms/o   %busy Name
>    70      0      0      0    0.0      0      0 0.0      0 0    
> 0.0      0    0.0    0.0| da2
>     0      0      0      0    0.0      0      0    0.0      0 0    
> 0.0      0 0.0   0.0| da3
>     0      0      0      0    0.0      0      0    0.0      0 0    
> 0.0      0 0.0   0.0| da10
>     0      0      0      0    0.0      0      0    0.0      0 0    
> 0.0      0 0.0    0.0| da11
>
>
> I then set the tags down to 1 for each device:
>
> #camcontrol tags da2 -N 1
> #camcontrol tags da3 -N 1
> #camcontrol tags da10 -N 1
> #camcontrol tags da11 -N 1
>
> And, no errors for the last hour, system still running at full load.
>
> Everything is feeling like an NCQ firmware issue.  Intel s3610 says it 
> supports NCQ in it's SSDs with 32 tags.  But I've pulled the errors 
> with tags set to 8 plenty of times.
>
> (See NCQ line below.)
>
> # camcontrol identify da2
>
> pass2: <INTEL SSDSC2BX480G4 G2010150> ACS-2 ATA SATA 3.x device
> pass2: 1200.000MB/s transfers, Command Queueing Enabled
> protocol              ATA/ATAPI-9 SATA 3.x
> device model          INTEL SSDSC2BX480G4
> firmware revision     G2010150
> serial number         [redacted]
> WWN [redacted]
> cylinders             16383
> heads                 16
> sectors/track         63
> sector size           logical 512, physical 4096, offset 0
> LBA supported         268435455 sectors
> LBA48 supported       937703088 sectors
> PIO supported         PIO4
> DMA supported         WDMA2 UDMA6
> media RPM             non-rotating
>
> Feature                      Support  Enabled   Value Vendor
> read ahead                     yes    yes
> write cache                    yes    yes
> flush cache                    yes    yes
> overlap                        no
> Tagged Command Queuing (TCQ)   no     no
> Native Command Queuing (NCQ)   yes              32 tags
> NCQ Queue Management           no
> NCQ Streaming                  no
> Receive & Send FPDMA Queued    no
> SMART                          yes    yes
> microcode download             yes    yes
> security                       yes    no
> power management               yes    yes
> advanced power management      no     no
> automatic acoustic management  no     no
> media status notification      no     no
> power-up in Standby            no     no
> write-read-verify              no     no
> unload                         yes    yes
> general purpose logging        yes    yes
> free-fall                      no     no
> Data Set Management (DSM/TRIM) yes
> DSM - max 512byte blocks       yes              4
> DSM - deterministic read       yes              zeroed
> Host Protected Area (HPA)      yes      no 937703088/937703088
> HPA - Security                 no
>
> And it doesn't appear I have any way to deactivate it in firmware.  
> Which would be a nice test.  I did attempt this with no luck:
> # camcontrol negotiate da2 -T disable
> (pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s
> (pass2:mpr0:0:10:0): tagged queueing: enabled
> camcontrol: XPT_SET_TRANS_SETTINGS CCB failed
>
> -Kyle
>
>
> On 6/7/16 12:09 PM, list-news wrote:
>> The system is a Twin.  In the first post I mentioned this but I 
>> probably wasn't clear.
>>
>> The twin unit is this one:
>> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm
>>
>> I've used all components from twin node A and B (cpu / memory / 
>> mainboard / controller).  I still get the errors.  The backplane was 
>> the original thought of concern, and that has been RMA'd and replaced 
>> - errors continue.  I've even swapped out power supplies with another 
>> identical unit I have here.
>>
>> In every case the errors continue, until I do this:
>> #camcontrol daX -N 1
>> (for each drive in the zpool)
>>
>> Then the errors stop.
>>
>> The system errors every few minutes while my application is running.  
>> Set tags to -N 1, and everything goes quiet.  16 cores at 100% cpu 
>> and drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it 
>> finishes a batch, no errors are reported with -N set to 1.  If I set 
>> tags with -N 255 for each device, errors start again within 5 
>> minutes, and continue every 2-5 minutes, until the batch is finished.
>>
>> -Kyle
>>
>>> I would try, if possible, to swap the controller.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Borja.
>>>
>>>
>>
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?331da785-c88b-d74e-512a-37bdb618d512>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation