Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 9 Apr 2012 12:12:22 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        Doug Ambrisko <ambrisko@ambrisko.com>
Cc:        scottl@freebsd.org, Alexander Motin <mav@freebsd.org>, freebsd-stable@freebsd.org
Subject:   Re: [stable-ish 9] Dell R815 ipmi(4) attach failure
Message-ID:  <201204091212.22948.jhb@freebsd.org>
In-Reply-To: <201204061801.q36I1FVw041177@ambrisko.com>
References:  <201204061801.q36I1FVw041177@ambrisko.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Friday, April 06, 2012 2:01:15 pm Doug Ambrisko wrote:
> Alexander Motin writes:
> | On 04/06/12 20:12, Doug Ambrisko wrote:
> | > Alexander Motin writes:
> | > | On 04/04/12 21:47, John Baldwin wrote:
> | > |>  On Wednesday, April 04, 2012 12:24:33 pm Doug Ambrisko wrote:
> | > |>>  John Baldwin writes:
> | > |>>  | On Tuesday, April 03, 2012 12:37:50 pm Doug Ambrisko wrote:
> | > |>>  |>   John Baldwin writes:
> | > |>>  |>   | On Monday, April 02, 2012 7:27:13 pm Doug Ambrisko wrote:
> | > |>>  |>   |>   Doug Ambrisko writes:
> | > |>>  |>   |>   | John Baldwin writes:
> | > |>>  |>   |>   | | On Saturday, March 31, 2012 3:25:48 pm Doug Ambrisko 
wrote:
> | > |>>  |>   |>   | |>   Sean Bruno writes:
> | > |>>  |>   |>   | |>   | Noting a failure to attach to the onboard IPMI 
controller
> | > |>  with
> | > |>>  | this
> | > |>>  |>   | dell
> | > |>>  |>   |>   | |>   | R815.  Not sure what to start poking at and 
thought I'd
> | > |>  though
> | > |>>  | this
> | > |>>  |>   | over
> | > |>>  |>   |>   | |>   | here for comment.
> | > |>>  |>   |>   | |>   |
> | > |>>  |>   |>   | |>   | -bash-4.2$ dmesg |grep ipmi
> | > |>>  |>   |>   | |>   | ipmi0: KCS mode found at io 0xca8 on acpi
> | > |>>  |>   |>   | |>   | ipmi1:<IPMI System Interface>   on isa0
> | > |>>  |>   |>   | |>   | device_attach: ipmi1 attach returned 16
> | > |>>  |>   |>   | |>   | ipmi1:<IPMI System Interface>   on isa0
> | > |>>  |>   |>   | |>   | device_attach: ipmi1 attach returned 16
> | > |>>  |>   |>   | |>   | ipmi0: Timed out waiting for GET_DEVICE_ID
> | > |>>  |>   |>   | |>
> | > |>>  |>   |>   | |>   I've run into this recently.  A quick hack to fix 
it is:
> | > |>>  |>   |>   | |>
> | > |>>  |>   |>   | |>   Index: ipmi.c
> | > |>>  |>   |>   | |>
> | > 	[snip]
> | > |>>  | If you use "-ct" then you get a file you can feed into 
schedgraph.
> | > |>>  | However, just reading the log, it seems that IRQ 20 keeps 
preempting
> | > |>>  | the KCS worker thread preventing it from getting anything done.  
Also,
> | > |>>  | there seem to be a lot of threads on CPU 0's runqueue waiting for 
a
> | > |>>  | chance to run (load average of 12 or 13 the entire time).  You 
can try
> | > |>>  | just bumping up the max timeout from 3 seconds to higher perhaps.  
Not
> | > |>>  | sure why IRQ 20 keeps firing though.  It might be related to USB, 
so
> | > |>>  | you could try fiddling with USB options in the BIOS perhaps, or 
disabling
> | > |>>  | the USB drivers to see if that fixes IPMI.
> | > |>>
> | > |>>  Tried without USB in kernel:
> | > |>>  	http://people.freebsd.org/~ambrisko/ipmi_ktr_dump_no_usb.txt
> | > |>
> | > |>  Hmm, it's still just running constantly (note that the idle thread 
is
> | > |>  _never_ scheduled).  The lion's share of the time seems to be spent 
in
> | > |>  "xpt_thrd".  Note that there are several places where nothing 
happens except
> | > |>  that "xpt_thrd" runs constantly (spinning) during 10's of statclock 
ticks.  I
> | > |>  would maybe start debugging that to see what in the world it is 
doing.  Maybe
> | > |>  it is polling some hardware down in xpt_action() (i.e., xpt_action() 
for a
> | > |>  single bus called down into a driver and it is just spinning using 
polling
> | > |>  instead of sleeping and waiting for an interrupt).
> | > |
> | > | "xpt_thrd" is a bus scanner thread. It is scheduled by CAM for every 
bus
> | > | on attach and by controller driver on hot-plug events. For some
> | > | controllers it may be quite CPU-hungry. For example, for legacy ATA
> | > | controllers, where bus reset may take many seconds of hardware 
polling,
> | > | while devices just spinning up. For ahci(4) it was improved about year
> | > | ago to not use polling when possible, but it still may loop for some
> | > | time if controller is not responding on reset. What mfi(4), mentioned 
in
> | > | log, does during scanning, I am not sure.
> | >
> | > I thought that mfi(4) could be an issue.  There are some ata controllers
> | > with nothing attached.  I built a GENERIC with USB and mfi commented out
> | > and then the timeout issue went away:
> | >    ipmi0: KCS mode found at io 0xca8 on acpi
> | >    ipmi1:<IPMI System Interface>  on isa0
> | >    device_attach: ipmi1 attach returned 16
> | >    ipmi1:<IPMI System Interface>  on isa0
> | >    device_attach: ipmi1 attach returned 16
> | >    ipmi0: DEBUG ipmi_submit_driver_request 551 before msleep 1
> | >    ipmi0: DEBUG ipmi_complete_request 527 before wakeup 2211
> | >    ipmi0: DEBUG ipmi_complete_request 529 after wakeup 2272
> | >    ipmi0: DEBUG ipmi_submit_driver_request 553 after msleep 2332
> | >    ipmi0: IPMI device rev. 0, firmware rev. 1.61, version 2.0
> | >
> | > Without mfi and with USB and it had issues:
> | >    ipmi0: KCS mode found at io 0xca8 on acpi
> | >    ipmi1:<IPMI System Interface>  on isa0
> | >    device_attach: ipmi1 attach returned 16
> | >    ipmi1:<IPMI System Interface>  on isa0
> | >    device_attach: ipmi1 attach returned 16
> | >    ipmi0: DEBUG ipmi_submit_driver_request 551 before msleep 2
> | >    ipmi0: DEBUG ipmi_complete_request 527 before wakeup 3137
> | >    ipmi0: DEBUG ipmi_complete_request 529 after wakeup 3199
> | >    ipmi0: DEBUG ipmi_submit_driver_request 553 after msleep 3259
> | >    ipmi0: Timed out waiting for GET_DEVICE_ID
> | >    ipmi0: IPMI device rev. 0, firmware rev. 1.61, version 2.0
> | >
> | > I can post more ktrdump traces if needed.  A 1U Dell machine without
> | > mfi also has this problem.  As John mentioned it might be good to
> | > bump up the timeout from 3s to 6s.  I did that with the USB no mfi
> | > kernel and that passed:
> | >
> | >    % dmesg | grep ipmi
> | >    ipmi0: KCS mode found at io 0xca8 on acpi
> | >    ipmi1:<IPMI System Interface>  on isa0
> | >    device_attach: ipmi1 attach returned 16
> | >    ipmi1:<IPMI System Interface>  on isa0
> | >    device_attach: ipmi1 attach returned 16
> | >    ipmi0: DEBUG ipmi_submit_driver_request 551 before msleep 2
> | >    ipmi0: DEBUG ipmi_complete_request 527 before wakeup 3137
> | >    ipmi0: DEBUG ipmi_complete_request 529 after wakeup 3199
> | >    ipmi0: DEBUG ipmi_submit_driver_request 553 after msleep 3259
> | >    ipmi0: IPMI device rev. 0, firmware rev. 1.61, version 2.0
> | >
> | > So maybe we need to agressively bump up the timeout.  I put a
> | > timeout since I didn't want the system to hang.  Anyone have a
> | > good idea of a timeout.  I thought I tried 6s initially and it
> | > had issues but then the machine I was playing with had 3 mfi
> | > cards and various disks hanging off it.
> | 
> | I have no idea about IPMI timeout to propose, but can't that check be 
> | remade opposite: if response received -- use it, otherwise -- check 
> | error value? Obviously it is not IPMI problem that CPU is busy, but 
> | ability to work in those conditions would be a bonus.
> 
> In theory, that is what it is supposed to be doing.  However, it seems
> like the entire system is busy such that when the IPMI task notifies
> the thread it is complete the thread even though might have got the
> wakeup fails due to sleeping for to long.  If this is indeed the case
> I can work around this by setting a "completion" variable and then
> check that instead of erroring out on the timeout failure of msleep.

Well, if the kcs thread truly does run to completion and call wakeup()
before the timeout, it should put the thread on the run queue and still
work fine.  The problem here is the kcs thread isn't getting a chance to run 
because of the other bug.  Having a separate completion token won't do 
anything to solve this.

> It just seems strange to put in code that effectively, says if the
> CPU is to slow to do anything to complete in the required time but
> did complete successfully don't treat that as an error!
> 
> Increasing the timeout shouldn't have any negative impact in boot times
> and is just a safety measure.  I think we already know the HW should be
> okay at this time.

We can up the timeout, I'm just worried it is papering over another bug (in 
USB it seems) that can cause other problems.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201204091212.22948.jhb>