Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 9 Mar 2011 01:30:14 GMT
From:      Neil  Schelly <nschelly@dyn.com>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/140416: [mfi] mfi driver stuck in timeout
Message-ID:  <201103090130.p291UEBI028424@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/140416; it has been noted by GNATS.

From: Neil  Schelly <nschelly@dyn.com>
To: bug-followup@FreeBSD.org, Stephane.DAlu@insa-lyon.fr
Cc:  
Subject: Re: kern/140416: [mfi] mfi driver stuck in timeout
Date: Tue, 8 Mar 2011 20:04:56 -0500 (EST)

 ------=_Part_97141_16009456.1299632696194
 Content-Type: text/plain; charset=utf-8
 Content-Transfer-Encoding: 7bit
 
 As posted here: http://lists.freebsd.org/pipermail/freebsd-scsi/2011-March/004832.html
 
 We've got some more information about the mpt testing we've been doing here.  The setup we're testing is Dell PowerEdge r610 servers with PERC H800 SAS/RAID cards connected to MD1200 shelves full of 12 SAS drives.  We've recreated the same problem on other configurations, including combinations of r510s, MD1220 shelves, PERC H700 cards, etc.  We've also eliminated any particular piece of hardware as faulty by running these on identical hardware configurations in mirrored setups on different physical pieces  of hardware.  We've experienced these issues in FreeBSD 7.3, 8.1, and 8.2.  We've experienced this issue with either RAID10 logical drive configurations formatted with UFS or 6-disk JBOD configurations setup in a ZFS raidz volume.  We've triggered the problem with both bonnie++ and iozone.  All machines are runnning the latest firmware on the H700 and H800 cards.
 
 The easiest method to reproduce this problem is with a ZFS configuration and using `iozone -a`.  We have a 6-disk raidz partition with a ZFS filesystem on it.  We just run `iozone -a` from within that filesystem, and I'd say 3 out of 4 times, it will eventually pause.  After 45-50 seconds of pausing, you'll start seeing the console and /var/log/messages output that looks something like: 
   mfi0: COMMAND 0xffffff8000db5fe0 TIMEOUT AFTER 105 SECONDS
 
 If we let it go for a few days, it may actually "finish" and recover, but it's essentially just stuck and not recovering.  The system is responsive and fully operational except the dead controller at this point.  We cannot kill the iozone process that is hung on these IO operations, even with `kill -9`.  Like others have reported, we can run any of the mfiutil commands and the controller immediately begins to respond normally again.  Usually, the iozone test will complete, but sometimes it will even get st uck again on the same run.  
 
 We compiled mfiutil with debugging symbols so we could run it with gdb and see exactly what was causing the controller to become responsive again.  It's the ioctl() call that does it.  For example:
 
 `mfiutil show volumes` eventually gets to something like:
   mfi_dcmd_command (fd=7, opcode=50397184, buf=0x7fffffffe4a0, bufsize=1032, mbox=0x0, mboxlen=0, statusp=0x0)
   at /usr/src/usr.sbin/mfiutil/mfi_cmd.c:257
  * fd=7 is /dev/mfi0, where the command will be sent with an ioctl command
  * opcode=50397184 is the MFI_DCMD_LD_GET_LIST command
 
 `mfiutil show battery` eventually gets to something like:
   mfi_dcmd_command (fd=7, opcode=84017152, buf=0x7fffffffea20, bufsize=48, mbox=0x0, mboxlen=0, statusp=0x7fffffffe9cf "")
   at /usr/src/usr.sbin/mfiutil/mfi_cmd.c:257
  * fd=7 is /dev/mfi0, where the command will be sent with an ioctl command
  * opcode=84017152 is the MFI_DCMD_BBU_GET_CAPACITY_INFO command
 
 I wrote a small self-contained C program that can easily be modified to run any ioctl command you'd like and send it to /dev/mfi0 (attached).  Use it if you'd like at your own risk, but it's essentially just running an arbitrary command with ioctl, putting nothing into the memory range normally passed by the *buf pointer.  I did try sending random opcodes, and it didn't work, so it does have to be an opcode that the firmware will recognize at least, but it doesn't seem to matter which one.
 
 I'm not sure where else we should be looking for a fix.  We can reliably reproduce the problem, analyze the system during the issue, and recover the system to a normal state.  If there's anyone who can help us troubleshoot this with any information we can gather or even a local login remotely accessible, we're open to ideas.  
 
 --
 Neil Schelly
 Director of Uptime
 Dynamic Network Services, Inc.
 W: 603-296-1581
 M: 508-410-4776
 http://www.dyndns.com
 
 
 ------=_Part_97141_16009456.1299632696194
 Content-Type: text/x-csrc; name=testperc.c
 Content-Transfer-Encoding: base64
 Content-Disposition: attachment; filename=testperc.c
 
 I2luY2x1ZGUgPGZjbnRsLmg+CiNpbmNsdWRlIDxzdGRpbnQuaD4KI2luY2x1ZGUgPGlvY2NvbS5o
 PgoKI2RlZmluZSBNRklfTUJPWF9TSVpFICAgICAgICAgICAxMgojZGVmaW5lIE1GSV9DTURfRENN
 RCAgICAgICAgICAweDA1CiNkZWZpbmUgTUZJSU9fUEFTU1RIUlUgIF9JT1dSKCdDJywgMTAyLCBz
 dHJ1Y3QgbWZpX2lvY19wYXNzdGhydSkKCi8qIERpcmVjdCBjb21tYW5kcyAqLwp0eXBlZGVmIGVu
 dW0gewogICAgICAgIE1GSV9EQ01EX0NUUkxfR0VUSU5GTyA9ICAgICAgICAgMHgwMTAxMDAwMCwK
 ICAgICAgICBNRklfRENNRF9DVFJMX01GQ19ERUZBVUxUU19HRVQgPTB4MDEwZTAyMDEsCiAgICAg
 ICAgTUZJX0RDTURfQ1RSTF9NRkNfREVGQVVMVFNfU0VUID0weDAxMGUwMjAyLAogICAgICAgIE1G
 SV9EQ01EX0NUUkxfRkxVU0hDQUNIRSA9ICAgICAgMHgwMTEwMTAwMCwKICAgICAgICBNRklfRENN
 RF9DVFJMX1NIVVRET1dOID0gICAgICAgIDB4MDEwNTAwMDAsCiAgICAgICAgTUZJX0RDTURfQ1RS
 TF9FVkVOVF9HRVRJTkZPID0gICAweDAxMDQwMTAwLAogICAgICAgIE1GSV9EQ01EX0NUUkxfRVZF
 TlRfR0VUID0gICAgICAgMHgwMTA0MDMwMCwKICAgICAgICBNRklfRENNRF9DVFJMX0VWRU5UX1dB
 SVQgPSAgICAgIDB4MDEwNDA1MDAsCiAgICAgICAgTUZJX0RDTURfUFJfR0VUX1NUQVRVUyA9ICAg
 ICAgICAweDAxMDcwMTAwLAogICAgICAgIE1GSV9EQ01EX1BSX0dFVF9QUk9QRVJUSUVTID0gICAg
 MHgwMTA3MDIwMCwKICAgICAgICBNRklfRENNRF9QUl9TRVRfUFJPUEVSVElFUyA9ICAgIDB4MDEw
 NzAzMDAsCiAgICAgICAgTUZJX0RDTURfUFJfU1RBUlQgPSAgICAgICAgICAgICAweDAxMDcwNDAw
 LAogICAgICAgIE1GSV9EQ01EX1BSX1NUT1AgPSAgICAgICAgICAgICAgMHgwMTA3MDUwMCwKICAg
 ICAgICBNRklfRENNRF9USU1FX1NFQ1NfR0VUID0gICAgICAgIDB4MDEwODAyMDEsCiAgICAgICAg
 TUZJX0RDTURfRkxBU0hfRldfT1BFTiA9ICAgICAgICAweDAxMGYwMTAwLAogICAgICAgIE1GSV9E
 Q01EX0ZMQVNIX0ZXX0RPV05MT0FEID0gICAgMHgwMTBmMDIwMCwKICAgICAgICBNRklfRENNRF9G
 TEFTSF9GV19GTEFTSCA9ICAgICAgIDB4MDEwZjAzMDAsCiAgICAgICAgTUZJX0RDTURfRkxBU0hf
 RldfQ0xPU0UgPSAgICAgICAweDAxMGYwNDAwLAogICAgICAgIE1GSV9EQ01EX1BEX0dFVF9MSVNU
 ID0gICAgICAgICAgMHgwMjAxMDAwMCwKICAgICAgICBNRklfRENNRF9QRF9HRVRfSU5GTyA9ICAg
 ICAgICAgIDB4MDIwMjAwMDAsCiAgICAgICAgTUZJX0RDTURfUERfU1RBVEVfU0VUID0gICAgICAg
 ICAweDAyMDMwMTAwLAogICAgICAgIE1GSV9EQ01EX1BEX1JFQlVJTERfU1RBUlQgPSAgICAgMHgw
 MjA0MDEwMCwKICAgICAgICBNRklfRENNRF9QRF9SRUJVSUxEX0FCT1JUID0gICAgIDB4MDIwNDAy
 MDAsCiAgICAgICAgTUZJX0RDTURfUERfQ0xFQVJfU1RBUlQgPSAgICAgICAweDAyMDUwMTAwLAog
 ICAgICAgIE1GSV9EQ01EX1BEX0NMRUFSX0FCT1JUID0gICAgICAgMHgwMjA1MDIwMCwKICAgICAg
 ICBNRklfRENNRF9QRF9HRVRfUFJPR1JFU1MgPSAgICAgIDB4MDIwNjAwMDAsCiAgICAgICAgTUZJ
 X0RDTURfUERfTE9DQVRFX1NUQVJUID0gICAgICAweDAyMDcwMTAwLAogICAgICAgIE1GSV9EQ01E
 X1BEX0xPQ0FURV9TVE9QID0gICAgICAgMHgwMjA3MDIwMCwKICAgICAgICBNRklfRENNRF9MRF9H
 RVRfTElTVCA9ICAgICAgICAgIDB4MDMwMTAwMDAsCiAgICAgICAgTUZJX0RDTURfTERfR0VUX0lO
 Rk8gPSAgICAgICAgICAweDAzMDIwMDAwLAogICAgICAgIE1GSV9EQ01EX0xEX0dFVF9QUk9QID0g
 ICAgICAgICAgMHgwMzAzMDAwMCwKICAgICAgICBNRklfRENNRF9MRF9TRVRfUFJPUCA9ICAgICAg
 ICAgIDB4MDMwNDAwMDAsCiAgICAgICAgTUZJX0RDTURfTERfSU5JVF9TVEFSVCA9ICAgICAgICAw
 eDAzMDYwMTAwLAogICAgICAgIE1GSV9EQ01EX0xEX0RFTEVURSA9ICAgICAgICAgICAgMHgwMzA5
 MDAwMCwKICAgICAgICBNRklfRENNRF9DRkdfUkVBRCA9ICAgICAgICAgICAgIDB4MDQwMTAwMDAs
 CiAgICAgICAgTUZJX0RDTURfQ0ZHX0FERCA9ICAgICAgICAgICAgICAweDA0MDIwMDAwLAogICAg
 ICAgIE1GSV9EQ01EX0NGR19DTEVBUiA9ICAgICAgICAgICAgMHgwNDAzMDAwMCwKICAgICAgICBN
 RklfRENNRF9DRkdfTUFLRV9TUEFSRSA9ICAgICAgIDB4MDQwNDAwMDAsCiAgICAgICAgTUZJX0RD
 TURfQ0ZHX1JFTU9WRV9TUEFSRSA9ICAgICAweDA0MDUwMDAwLCAgICAgCiAgICAgICAgTUZJX0RD
 TURfQ0ZHX0ZPUkVJR05fSU1QT1JUID0gICAweDA0MDYwNDAwLAogICAgICAgIE1GSV9EQ01EX0JC
 VV9HRVRfU1RBVFVTID0gICAgICAgMHgwNTAxMDAwMCwKICAgICAgICBNRklfRENNRF9CQlVfR0VU
 X0NBUEFDSVRZX0lORk8gPTB4MDUwMjAwMDAsCiAgICAgICAgTUZJX0RDTURfQkJVX0dFVF9ERVNJ
 R05fSU5GTyA9ICAweDA1MDMwMDAwLAogICAgICAgIE1GSV9EQ01EX0NMVVNURVIgPSAgICAgICAg
 ICAgICAgMHgwODAwMDAwMCwKICAgICAgICBNRklfRENNRF9DTFVTVEVSX1JFU0VUX0FMTCA9ICAg
 IDB4MDgwMTAxMDAsCiAgICAgICAgTUZJX0RDTURfQ0xVU1RFUl9SRVNFVF9MRCA9ICAgICAweDA4
 MDEwMjAwCn0gbWZpX2RjbWRfdDsKCnN0cnVjdCBtZmlfZnJhbWVfaGVhZGVyIHsKICAgICAgICB1
 aW50OF90ICAgICAgICAgY21kOwogICAgICAgIHVpbnQ4X3QgICAgICAgICBzZW5zZV9sZW47CiAg
 ICAgICAgdWludDhfdCAgICAgICAgIGNtZF9zdGF0dXM7CiAgICAgICAgdWludDhfdCAgICAgICAg
 IHNjc2lfc3RhdHVzOwogICAgICAgIHVpbnQ4X3QgICAgICAgICB0YXJnZXRfaWQ7CiAgICAgICAg
 dWludDhfdCAgICAgICAgIGx1bl9pZDsKICAgICAgICB1aW50OF90ICAgICAgICAgY2RiX2xlbjsK
 ICAgICAgICB1aW50OF90ICAgICAgICAgc2dfY291bnQ7CiAgICAgICAgdWludDMyX3QgICAgICAg
 IGNvbnRleHQ7CiAgICAgICAgdWludDMyX3QgICAgICAgIHBhZDA7CiAgICAgICAgdWludDE2X3Qg
 ICAgICAgIGZsYWdzOwojZGVmaW5lIE1GSV9GUkFNRV9EQVRBT1VUICAgICAgIDB4MDgKI2RlZmlu
 ZSBNRklfRlJBTUVfREFUQUlOICAgICAgICAweDEwCiAgICAgICAgdWludDE2X3QgICAgICAgIHRp
 bWVvdXQ7CiAgICAgICAgdWludDMyX3QgICAgICAgIGRhdGFfbGVuOwp9IF9fcGFja2VkOwoKc3Ry
 dWN0IG1maV9zZzMyIHsKICAgICAgICB1aW50MzJfdCAgICAgICAgYWRkcjsKICAgICAgICB1aW50
 MzJfdCAgICAgICAgbGVuOwp9IF9fcGFja2VkOwoKc3RydWN0IG1maV9zZzY0IHsKICAgICAgICB1
 aW50NjRfdCAgICAgICAgYWRkcjsKICAgICAgICB1aW50MzJfdCAgICAgICAgbGVuOwp9IF9fcGFj
 a2VkOwoKdW5pb24gbWZpX3NnbCB7CiAgICAgICAgc3RydWN0IG1maV9zZzMyIHNnMzJbMV07CiAg
 ICAgICAgc3RydWN0IG1maV9zZzY0IHNnNjRbMV07Cn0gX19wYWNrZWQ7CgpzdHJ1Y3QgbWZpX2Rj
 bWRfZnJhbWUgewogICAgICAgIHN0cnVjdCBtZmlfZnJhbWVfaGVhZGVyIGhlYWRlcjsKICAgICAg
 ICB1aW50MzJfdCAgICAgICAgb3Bjb2RlOwogICAgICAgIHVpbnQ4X3QgICAgICAgICBtYm94W01G
 SV9NQk9YX1NJWkVdOwogICAgICAgIHVuaW9uIG1maV9zZ2wgICBzZ2w7Cn0gX19wYWNrZWQ7Cgpz
 dHJ1Y3QgbWZpX2lvY19wYXNzdGhydSB7CiAgICAgICAgc3RydWN0IG1maV9kY21kX2ZyYW1lICAg
 aW9jX2ZyYW1lOwogICAgICAgIHVpbnQzMl90ICAgICAgICAgICAgICAgIGJ1Zl9zaXplOwogICAg
 ICAgIHVpbnQ4X3QgICAgICAgICAgICAgICAgICpidWY7Cn0gX19wYWNrZWQ7CgppbnQgbWFpbigp
 IHsKICAgICAgICBpbnQgZmQgPSBvcGVuKCIvZGV2L21maTAiLCBPX1JEV1IpOwogICAgICAgIHN0
 cnVjdCBtZmlfaW9jX3Bhc3N0aHJ1IGlvYzsKICAgICAgICBzdHJ1Y3QgbWZpX2RjbWRfZnJhbWUg
 KmRjbWQ7CgljaGFyKiBidWY7CgogICAgICAgIGRjbWQgPSAmaW9jLmlvY19mcmFtZTsKICAgICAg
 ICBkY21kLT5oZWFkZXIuY21kID0gTUZJX0NNRF9EQ01EOwogICAgICAgIGRjbWQtPmhlYWRlci50
 aW1lb3V0ID0gMDsKICAgICAgICBkY21kLT5oZWFkZXIuZmxhZ3MgPSAwOwogICAgICAgIGRjbWQt
 PmhlYWRlci5kYXRhX2xlbiA9IDA7CiAgICAgICAgZGNtZC0+b3Bjb2RlID0gTUZJX0RDTURfQ0ZH
 X0FERDsKLy8gICAgICAgIGRjbWQtPm9wY29kZSA9IDA7CiAgICAgICAgCiAgICAgICAgaW9jLmJ1
 ZiA9IGJ1ZjsKICAgICAgICBpb2MuYnVmX3NpemUgPSA4OyAKICAgICAgICBpb2N0bChmZCwgTUZJ
 SU9fUEFTU1RIUlUsICZpb2MpOwoKfQoK
 ------=_Part_97141_16009456.1299632696194--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201103090130.p291UEBI028424>