Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Jul 1998 12:08:36 -0700
From:      Mitch Lichtenberg <mitch@pa.dec.com>
To:        "'current@freebsd.org'" <current@FreeBSD.ORG>
Subject:   Hard hangs of -current under heavy load - how to debug?
Message-ID:  <c=US%a=_%p=DEC%l=SRC-EXCHANGE-980726190836Z-4659@src-exchange.pa.dec.com>

next in thread | raw e-mail | index | archive | help

I've been experiencing some random hangs on -current
releases over the past few months (I'm currently at 
3.0-19980723, but I've seen this since last December).
The systems operate under heavy load for about 24 hours, 
then one or two randomly hang.  The hangs are hard (no 
console messages, no dumps/traps, can't escape to 
the debugger).  It looks like interrupts are disabled.   

Generally, how do you debug a hang like this?  Are there
any generic techniques or kernel options that I can 
enable to help me figure this one out?  My
next step is to hook up a button to the NMI line
to see if I can get into DDB that way, but perhaps
there's someting easier I can do in the meantime,
or maybe there are known problems with my configuration
that someone can point out to me.

----

Workload / system description, for those that 
are interested:

I've got a network of ten identical machines.  They
netboot from a "master" machine (I did a netboot
driver for the DEC DC21143 Ethernet chip if anyone's
interested).

The workload is a distributed storage application I'm
working on, which generates a huge amount of UDP
traffic and disk I/O.  When the tests are running,
the net and disk are running flat out, near maximum
throughput.  The application is basically I/O 
bound - I seldom see more than 15% CPU utilization.

At present, some PCs are servers (lots of disk and net
traffic), and some are clients (only net traffic). 
Both the clients and servers are affected by this
problem, so I'm tempted to believe the disk is OK,
but servers do crash more often than clients.  The
"master" machine, identical to the others, has
never crashed.  

Could there be anything screwy about the hardware
interrupt mechanism, or known problems with the VIA
VP2/97 chipset?

(see
http://www.research.digital.com/SRC/personal/Ed_Lee/Petal/petal.html
if you'd like to know more about the project) 


Basic configuration: 

Motherboard:  FIC PA-2007 motherboard (VIA VP2/97 chipset (for ECC)),
Processor:    Cyrix 6x86MX processor
Memory:	      64MB
Disk:	      Four IBM Deskstar 8.4GB, UltraDMA, all masters
	      (Promise Ultra33 IDE controller for drives 3 and 4)
Network:      DEC DE500-BA (DC21143) 100Mb/s, connected to
	      a Prominet fast ethernet switch
	      The machines boot via netboot.


Thanks!

Mitch Lichtenberg
COMPAQ Systems Research Center (yes, formerly Digital Equipment Corp.)
Palo Alto, CA.
mitch@pa.dec.com




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?c=US%a=_%p=DEC%l=SRC-EXCHANGE-980726190836Z-4659>