Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Apr 2009 14:50:08 +0200
From:      Greg Byshenk <freebsd@byshenk.net>
To:        freebsd-stable@freebsd.org
Subject:   em0 watchdog timeout (and 3ware problems) 7-stable
Message-ID:  <20090426125008.GK1550@core.byshenk.net>

next in thread | raw e-mail | index | archive | help
I have one machine that is seeing watchdog timeouts on em0, running 7-STABLE
amd64 as of 2009.04.19, and also some other more perverse errors.

Twice now in the last 48 hours, this machine has become unreachable via the
network, and connecting to the console shows an endless string of 

   [...]
   em0: watchdog timeout -- resetting
   em0: watchdog timeout -- resetting
   em0: watchdog timeout -- resetting

messages. The machine is almost locked up.  That is, I can get a login
prompt, but can go no further than typing in a username; after the
username, no password prompt, and nothing further.  The only option is
to hard reset the machine or to drop to debugger and reboot.

Now the "perverse" part.  After restarting, the system partition is no
more.

Background detail:  the machine is a fileserver, with a 3Ware 9650SE-16ML
SATA controller, connected to 16 1TB SATA drives, this configured as
a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition
and 6.5TB data partition.  The system partition is configured as da1,
with one slice and more or less standard partitions for / /var /tmp, etc.
(the data partition of the array is sliced with gpt).

The issue here is that, upon restart, all parition information on da0
seems to have disappeared, and restarting results in a "no operating
system found" message, and a failure to boot (obviously).

But all of the data is still present.  If I boot into rescue mode,
recreate da0s1, mark it bootable, and restore the bsdlabel, then
everything works again.  I can restart the machine, and it comes back
up normally (it requires an fsck of everything on da0, but after that
everything is back to normal).

I don't know if this is two unrelated problems, or one problem with
two symptoms, or something else.  I think that I can safely say that
it is not a problem with the 3Ware controller itself, as I replaced
the controller with a spare (identical model), and the problem
recurred.  Additionally, I have an almost-identical configuration on
four other machines, none of which are experiencing any problems.
One thing that is different is that the other machines use
Intel PRO/1000 PF (pci-e) NICs.

Is there some known problem with the Intel 2572 fibre NIC?  Or some
potential interaction of it with the 3ware RAID controller?

For the moment, I've set hw.pci.enable_msi=0 (as discussed in the
threads on 7.2/bge), and am building a new kernel/world from sources
csup'd one hour ago, but I'd really like to hear any ideas about this
-- particularly the wiping of the label.

Some information about the system:


# /dev/da0s1:
8 partitions:
#        size   offset    fstype   [fsize bsize bps/cpg]
  a:  2097152        0    4.2BSD        0     0     0 
  b:  8388608  2097152      swap                    
  c: 104856192        0    unused        0     0         # "raw" part, don't edit
  d:  8388608 10485760    4.2BSD        0     0     0 
  e:  2097152 18874368    4.2BSD        0     0     0 
  f: 41943040 20971520    4.2BSD        0     0     0 
  g: 41941632 62914560    4.2BSD        0     0     0 


em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02 hdr=0x00
    vendor     = 'Intel Corporation'thernet Controller (Fiber)'
    device     = '2572 10/100/1000 Ethernet Controller (Fiber)'
    class      = networktory, range 32, base 0xda000000, size 131072, enabled
    subclass   = ethernetory, range 32, base 0xda000000, size 131072, enabled
    bar   [10] = type Memory, range 32, base 0xda000000, size 131072, enabled
    bar   [14] = type Memory, range 32, base 0xda020000, size 65536, enabled0x00
 
twa0@pci0:9:0:0:        class=0x010400 card=0x100413c1 chip=0x100413c1 rev=0x01 hdr=0x00
    device     = '9650SE Series PCI-Express SATA2 Raid Controller'
    class      = mass storage
    subclass   = RAID
    bar   [10] = type Prefetchable Memory, range 64, base 0xd8000000, size 33554432, enabled
    bar   [18] = type Memory, range 64, base 0xda300000, size 4096, enabled
    bar   [20] = type I/O Port, range 32, base 0x3000, size 256, enabled
    cap 01[40] = powerspec 2  supports D0 D1 D2 D3  current D0
    cap 05[50] = MSI supports 32 messages, 64 bit
    cap 10[70] = PCI-Express 1 legacy endpoint

-- 
greg byshenk  -  gbyshenk@byshenk.net  -  Leiden, NL



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090426125008.GK1550>