Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 15 Jan 2010 14:48:19 -0800
From:      Pyun YongHyeon <pyunyh@gmail.com>
To:        Floris Bos <info@je-eigen-domein.nl>
Cc:        freebsd-net@freebsd.org
Subject:   Re: kern/92090: [bge] bge: watchdog timeout -- resetting
Message-ID:  <20100115224819.GK1228@michelle.cdnetworks.com>
In-Reply-To: <201001152246.50315.info@je-eigen-domein.nl>
References:  <201001140140.o0E1e5hr072464@freefall.freebsd.org> <201001150333.59107.info@je-eigen-domein.nl> <20100115185424.GG1228@michelle.cdnetworks.com> <201001152246.50315.info@je-eigen-domein.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jan 15, 2010 at 10:46:50PM +0100, Floris Bos wrote:
> On Friday 15 January 2010 07:54:24 pm Pyun YongHyeon wrote:
> > On Fri, Jan 15, 2010 at 03:33:58AM +0100, Floris Bos wrote:
> > > On Friday 15 January 2010 01:53:16 am Pyun YongHyeon wrote:
> > > > On Thu, Jan 14, 2010 at 09:48:56PM +0100, Floris Bos wrote:
> > > > > On Thursday 14 January 2010 09:11:44 pm Pyun YongHyeon wrote:
> > > > > > On Thu, Jan 14, 2010 at 09:08:02PM +0100, Floris Bos wrote:
> > > > > > > On Thursday 14 January 2010 06:56:03 pm Pyun YongHyeon wrote:
> > > > > > > > On Thu, Jan 14, 2010 at 04:33:19AM +0100, Floris Bos wrote:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > On Thursday 14 January 2010 03:54:52 am Pyun YongHyeon wrote:
> > > > > > > > > > >  ==
> > > > > > > > > > >  bge0: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xdf900000-0xdf90ffff irq 16 at device 0.0 on pci32
> > > > > > > > > > >  ==
> > > > > > > > > > >  
> > > > > > > > > > >  After boot, the network works for about 5 seconds, barely enough time to get an IP by DHCP, and sent a ping or 2.
> > > > > > > > > > >  Then network connectivity goes down, and after some time there is a "bge0: watchdog timeout -- resetting" message.
> > > > > > > > > > >  
> > > > > > > > > > >  Then network works again for 5 seconds, and goes down again. All the time, repeatedly.
> > > > > > > > > > >  
> > > > > > > > > > >  The system works fine under Ubuntu. So I assume the hardware is ok.
> > > > > > > > > > >  
> > > > > > > > > > 
> > > > > > > > > > I'm not sure but it looks like you have a BCM5784 controller. What is
> > > > > > > > > > the output of "devinfo -rv | grep phy"?
> > > > > > > > > 
> > > > > > > > > ==
> > > > > > > > > ukphy0 pnpinfo oui=0x50ef model=0x3a rev=0x4 at phyno=1
> > > > > > > > > ukphy1 pnpinfo oui=0x50ef model=0x3a rev=0x4 at phyno=1
> > > > > > > > > ==
> > > > > > > > 
> > > > > > > > Support for the PHY was added in r202269.
> > > > > > > > Please try again after applying the change. Or you can download
> > > > > > > > sys/dev/mii/miidevs and sys/dev/mii/brgphy.c from HEAD and rebuild
> > > > > > > > kernel.
> > > > > > > 
> > > > > > > Fetched the latest source using CVS on another computer, and transferred it to the system concerned by USB stick.
> > > > > > > Rebuild the kernel, but the problem is still there.
> > > > > > > 
> > > > > > Would you show me full dmesg output including "watchodg timeout"
> > > > > > messages?
> > > > > 
> > > > > ===
> > > > > Copyright (c) 1992-2010 The FreeBSD Project.
> > > > > Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
> > > > > 	The Regents of the University of California. All rights reserved.
> > > > 
> > > > [...]
> > > > 
> > > > > bge0: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xdf900000-0xdf90ffff irq 16 at device 0.0 on pci32
> > > > > miibus0: <MII bus> on bge0
> > > > > brgphy0: <BCM5784 10/100/1000baseTX PHY> PHY 1 on miibus0
> > > > > brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
> > > > > bge0: Ethernet address: f4:ce:46:0f:2a:2c
> > > > > bge0: [FILTER]
> > > > > pcib4: <ACPI PCI-PCI bridge> irq 16 at device 28.5 on pci0
> > > > > pci34: <ACPI PCI bus> on pcib4
> > > > > bge1: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xdfa00000-0xdfa0ffff irq 17 at device 0.0 on pci34
> > > > > miibus1: <MII bus> on bge1
> > > > > brgphy1: <BCM5784 10/100/1000baseTX PHY> PHY 1 on miibus1
> > > > > brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
> > > > > bge1: Ethernet address: f4:ce:46:0f:2a:2d
> > > > > bge1: [FILTER]
> > > > 
> > > > [...]
> > > > 
> > > > Would you give attached patch try? I don't know whether it help
> > > > or not though. I couldn't find any related information for possible
> > > > clue of the issue in publicly available datasheet.
> > > 
> > > The patch did not make any difference.
> > > 
> > > 
> > > However I did notice something else odd.
> > > The problem only occurs on bge0, the second interface bge1 does work.
> > > 
> > > I grabbed the U57DIAG diagnostic boot CD from the Broadcom site, and noticed that the first interface has ASF enabled, while the second one has not.
> > > I disabled ASF by doing:
> > > 
> > > =
> > > b57udiag -cmd
> > > setasf -d
> > > ==
> > > 
> > > And now the first interface also works properly.
> > > 
> > 
> > Glad to hear you solved the issue. I totally forgot CURRENT enabled
> > ASF support by default(hw.bge.allow_asf).
> > 
> > > So there is something with the ASF stuff that conflicts with FreeBSD.
> > > The IPMI card of the system is configured to use a dedicated 3rd LAN port, and is NOT sharing bge0.
> > > But perhaps the NIC is initialized differently nevertheless when ASF firmware is enabled, and that is causing issues?
> > > 
> > 
> > Yes, I remember there were a couple of issues related with ASF.
> > Linux seems to have very complex logic to coexist with ASF/IPMI
> > firmware which I don't still understand its implications at this
> > time. bge(4) may need more robust code to handle that but datasheet
> > seems to show very limited information. Lack of ASF/IPMI capable
> > bge(4) controller also make me hard to experiment some code.
> 
> Can understand the difficulty to debug such things, without having the hardware.
> So I did some more research myself, and found the bug.
> 
> You said Linux was complicated, so I took a look at the Opensolaris bge source instead, to see how they do ASF things and I noticed the following comment ( http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c )
> 
> ==
>    5698 /*
>    5699  * The driver is supposed to notify ASF that the OS is still running
>    5700  * every three seconds, otherwise the management server may attempt
>    5701  * to reboot the machine.  If it hasn't actually failed, this is
>    5702  * not a desirable result.  However, this isn't running as a real-time
>    5703  * thread, and even if it were, it might not be able to generate the
>    5704  * heartbeat in a timely manner due to system load.  As it isn't a
>    5705  * significant strain on the machine, we will set the interval to half
>    5706  * of the required value.
>    5707  */
> ==
> 
> What a coincidence, although not the entire system is rebooted, my network link went up & down every 3 seconds according to the switch.
> 
> Seems FreeBSD only notifies ASF every 5 seconds. Attached a patch that reduces it to 2 seconds, and it solves the problem for me, with ASF enabled.
> 

Nice catch! Thanks a lot!
Actually I guess there is another bug in ASF handling. I'll request
CFT to list and see how other bge(4) controllers work.

> 
> Yours sincerely,
> 
> Floris Bos





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100115224819.GK1228>