Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 7 Jun 2016 17:43:10 -0500
From:      list-news <list-news@mindpackstudios.com>
To:        freebsd-scsi@freebsd.org
Subject:   Re: Avago LSI SAS 3008 & Intel SSD Timeouts
Message-ID:  <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com>
In-Reply-To: <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk>
References:  <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
No, it threw errors on both da6 and da7 and then I stopped it.

Your last e-mail gave me thoughts though.  I have a server with 2008 
controllers (entirely different backplane design, cpu, memory, etc).  
I've moved the 4 drives to that and I'm running the test now.

# uname = FreeBSD 10.2-RELEASE-p12 #1 r296215
# sysctl dev.mps.0
dev.mps.0.spinup_wait_time: 3
dev.mps.0.chain_alloc_fail: 0
dev.mps.0.enable_ssu: 1
dev.mps.0.max_chains: 2048
dev.mps.0.chain_free_lowwater: 1176
dev.mps.0.chain_free: 2048
dev.mps.0.io_cmds_highwater: 510
dev.mps.0.io_cmds_active: 0
dev.mps.0.driver_version: 20.00.00.00-fbsd
dev.mps.0.firmware_version: 17.00.01.00
dev.mps.0.disable_msi: 0
dev.mps.0.disable_msix: 0
dev.mps.0.debug_level: 3
dev.mps.0.%parent: pci5
dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 
subdevice=0x3020 class=0x010700
dev.mps.0.%location: slot=0 function=0
dev.mps.0.%driver: mps
dev.mps.0.%desc: Avago Technologies (LSI) SAS2008

About 1.5 hours has passed at full load, no errors.

gstat drive busy% seems to hang out around 30-40 instead of ~60-70.  
Overall throughput seems to be 20-30% less with my rough benchmarks.

I'm not sure if this gets us closer to the answer, if this doesn't 
time-out on the 2008 controller, it looks like one of these:
1) The Intel drive firmware is being overloaded somehow when connected 
to the 3008.
or
2) The 3008 firmware or driver has an issue reading drive responses, 
sporadically thinking the command timed-out (when maybe it really didn't).

Puzzle pieces:
A) Why does setting tags of 1 on drives connected to the 3008 fix the 
problem?
B) With tags of 255.  Why does postgres (and assuming a large fsync 
count), seem to cause the problem within minutes?  While running other 
heavy i/o commands (zpool scrub, bonnie++, fio), all of which show 
similarly high or higher iops take hours to cause the problem (if ever).

I'll let this continue to run to further test.

Thanks again for all the help.

-Kyle

On 6/7/16 4:22 PM, Steven Hartland wrote:
> Always da6?
>
> On 07/06/2016 21:19, list-news wrote:
>> Sure Steve:
>>
>> # cat /boot/loader.conf | grep trim
>> vfs.zfs.trim.enabled=0
>>
>> # sysctl vfs.zfs.trim.enabled
>> vfs.zfs.trim.enabled: 0
>>
>> # uptime
>> 3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07
>>
>> # tail -f /var/log/messages:
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm 
>> 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 
>> allocated tm 0xfffffe0001322150
>> Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
>> Aborting command 0xfffffe0001375580
>> Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from 
>> mprsas_send_abort for target ID 16
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
>> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout 
>> cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010)
>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm 
>> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150
>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>> Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 
>> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b 
>> scsi 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 
>> 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 
>> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b 
>> scsi 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 
>> 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 
>> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b 
>> scsi 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 
>> 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm 
>> 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 
>> scsi 0 state c    (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 
>> 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm 
>> 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00
>> Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during 
>> recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
>> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc 
>> 804b scsi 0 sta(da6:te c xfer 0
>> Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort 
>> TaskMID 1016 status 0x0 code 0x0 count 5
>> Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 
>> finished recovery after aborting TaskMID 1016
>> Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
>> Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>> Jun  7 15:13:50 s18 kernel: EventDataLength: 4
>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>> Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>> Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
>> Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
>> Jun  7 15:13:50 s18 kernel: NumPhys: 31
>> Jun  7 15:13:50 s18 kernel: NumEntries: 1
>> Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
>> Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>> Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
>> Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
>> Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
>> Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on  
>> Event: [16]
>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16]
>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on  
>> Event: [1c]
>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c]
>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>> Jun  7 15:13:50 s18 kernel: Flags: 0
>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on  
>> Event: [16]
>> Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16]
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
>> CDB: 35 00 00 00 00 00 00 00 00 00
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI 
>> Status Error
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check 
>> Condition
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT 
>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per 
>> sense data)
>>
>> -Kyle
>>
>> On 6/7/16 2:53 PM, Steven Hartland wrote:
>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO 
>>> delete but assuming your running ZFS can you set the following in 
>>> loader.conf and see how you get on.
>>> vfs.zfs.trim.enabled=0
>>>
>>>     Regards
>>>     Steve
>>
>>
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?99b3b075-3158-29aa-3a33-311594fb9270>