Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 8 Dec 2014 11:14:31 +0100
From:      patpro@patpro.net
To:        freebsd-scsi@freebsd.org
Subject:   multipath problem: active provider chosen on passive FC path?
Message-ID:  <23F06C2E-558A-4E68-AD35-B3CD49760DFE@patpro.net>

next in thread | raw e-mail | index | archive | help
Hello,

I'm not sure it's the best place to expose my problem, let me know if =
another mailing list is recommended.

I've installed FreeBSD 9.3 on two HP blade servers (G6), into an HP =
C7000 chassis. This chassis uses two Brocade FC switches (active/passive =
if I'm not mistaken). The blade servers use QLogic HBA:

isp0: <Qlogic ISP 2432 PCI FC-AL Adapter> port 0x4000-0x40ff mem =
0xfbff0000-0xfbff3fff irq 30 at device 0.0 on pci6
isp1: <Qlogic ISP 2432 PCI FC-AL Adapter> port 0x4400-0x44ff mem =
0xfbfe0000-0xfbfe3fff irq 37 at device 0.1 on pci6


A SAN array presents a dedicated logical unit to each FreeBSD server. On =
a given server I see 4 paths to the presented LU that I use to create a =
GEOM_MULTIPATH device:

(from dmesg)
GEOM_MULTIPATH: SPLUNK_1 created
GEOM_MULTIPATH: da2 added to SPLUNK_1
GEOM_MULTIPATH: da2 is now active path in SPLUNK_1
GEOM_MULTIPATH: da3 added to SPLUNK_1
GEOM_MULTIPATH: da6 added to SPLUNK_1
GEOM_MULTIPATH: da7 added to SPLUNK_1

# camcontrol devlist | grep VRAID
<DGC VRAID 0532>                   at scbus0 target 2 lun 0 (pass4,da2)
<DGC VRAID 0532>                   at scbus0 target 3 lun 0 (pass5,da3)
<DGC VRAID 0532>                   at scbus1 target 4 lun 0 (pass12,da6)
<DGC VRAID 0532>                   at scbus1 target 5 lun 0 (pass13,da7)

# gmultipath status
              Name   Status  Components
multipath/SPLUNK_1  OPTIMAL  da2 (ACTIVE)
                             da3 (PASSIVE)
                             da6 (PASSIVE)
                             da7 (PASSIVE)

Unfortunately during boot, and during normal operation, the first =
provider (da2 here) seems faulty:

isp0: Chan 0 Abort Cmd for N-Port 0x0008 @ Port 0x090a00
(da2:isp0:0:2:0): Command Aborted
(da2:isp0:0:2:0): READ(6). CDB: 08 00 03 28 02 00=20
(da2:isp0:0:2:0): CAM status: CCB request aborted by the host
(da2:isp0:0:2:0): Retrying command
../..
isp0: Chan 0 Abort Cmd for N-Port 0x0008 @ Port 0x090a00
(da2:isp0:0:2:0): Command Aborted
(da2:isp0:0:2:0): WRITE(10). CDB: 2a 00 00 50 20 21 00 00 05 00=20
(da2:isp0:0:2:0): CAM status: CCB request aborted by the host
(da2:isp0:0:2:0): Retrying command
../..

Those errors make the boot really slow (10-15 minutes), but the device =
is not deactivated. On both servers it's always the first provider of =
the multipath device that seems faulty (always the first one on scbus0). =
So I guess scbus0 is connected to the passive FC switch.

If I use extensively on the multipath device, the faulty provider will =
eventually be marked as failed and another one will be marked ACTIVE, =
chosen on scbus1. As soon as a provider on scbus1 is marked ACTIVE, the =
read/write throughput comes back to expected values.

For example, disktnfo(1) shows horrendous performances (+240ms seek =
time...)

# diskinfo -t /dev/multipath/SPLUNK_1
/dev/multipath/SPLUNK_1
	512         	# sectorsize
	107374181888	# mediasize in bytes (100G)
	209715199   	# mediasize in sectors
	0           	# stripesize
	0           	# stripeoffset
	13054       	# Cylinders according to firmware.
	255         	# Heads according to firmware.
	63          	# Sectors according to firmware.
	CKM00114800912	# Disk ident.

Seek times:
	Full stroke:	  250 iter in   1.172849 sec =3D    4.691 msec
	Half stroke:	  250 iter in   2.499101 sec =3D    9.996 msec
	Quarter stroke:	  500 iter in 124.113431 sec =3D  248.227 msec
	Short forward:	  400 iter in  62.483828 sec =3D  156.210 msec
	Short backward:	  400 iter in  62.844187 sec =3D  157.110 msec
	Seq outer:	 2048 iter in 240.999614 sec =3D  117.676 msec
	Seq inner:	 2048 iter in 121.210282 sec =3D   59.185 msec
	(during this test da2 is marked failed:
	GEOM_MULTIPATH: Error 5, da2 in SPLUNK_1 marked FAIL=20
	GEOM_MULTIPATH: da7 is now active path in SPLUNK_1=20
	and the transfer rates test goes well:)
Transfer rates:
	outside:       102400 kbytes in   1.023942 sec =3D   100006 =
kbytes/sec
	middle:        102400 kbytes in   1.104299 sec =3D    92729 =
kbytes/sec
	inside:        102400 kbytes in   1.137533 sec =3D    90019 =
kbytes/sec

# gmultipath status
              Name    Status  Components
multipath/SPLUNK_1  DEGRADED  da2 (FAIL)
                              da3 (PASSIVE)
                              da6 (PASSIVE)
                              da7 (ACTIVE)

Is there any way I can tell GEOM to use an active provider chosen on =
scbus1 at boot time? Is there any chance I totally misunderstand the =
problem?

(other blades in the same chassis are used for ESXi VMware production =
for years without any problem, so I guess switches and SAN are correctly =
configured)

thanks,

Patrick
--=20

# sysctl -a | grep dev.isp
dev.isp.0.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.0.%driver: isp
dev.isp.0.%location: slot=3D0 function=3D0 handle=3D\_SB_.PCI0.PT07.SLT0
dev.isp.0.%pnpinfo: vendor=3D0x1077 device=3D0x2432 subvendor=3D0x103c =
subdevice=3D0x1705 class=3D0x0c0400
dev.isp.0.%parent: pci6
dev.isp.0.wwnn: 5764963215108688473
dev.isp.0.wwpn: 5764963215108688472
dev.isp.0.loop_down_limit: 60
dev.isp.0.gone_device_time: 30
dev.isp.1.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.1.%driver: isp
dev.isp.1.%location: slot=3D0 function=3D1 handle=3D\_SB_.PCI0.PT07.SLT1
dev.isp.1.%pnpinfo: vendor=3D0x1077 device=3D0x2432 subvendor=3D0x103c =
subdevice=3D0x1705 class=3D0x0c0400
dev.isp.1.%parent: pci6
dev.isp.1.wwnn: 5764963215108688475
dev.isp.1.wwpn: 5764963215108688474
dev.isp.1.loop_down_limit: 60
dev.isp.1.gone_device_time: 30







Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?23F06C2E-558A-4E68-AD35-B3CD49760DFE>