Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Mar 2007 14:08:48 +0000
From:      Adrian Wontroba <aw1@stade.co.uk>
To:        freebsd-stable@freebsd.org
Subject:   6.2-STABLE deadlock?
Message-ID:  <20070313140848.GA89182@steerpike.hanley.stade.co.uk>

next in thread | raw e-mail | index | archive | help
At work, amoungst my stable of old computers running FreeBSD, I have a
Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This
primarily runs Nagios and a small and lightly used MySQL database, along
with a few inbound FTP transfers per minute. It has a Mylex card based
disc subsystem, ruling out crash dumps.

At some point during 5.5-STABLE this machine started to occasionally hang while
performing its daily "application" housekeeping - closing and restarting
Apache and Nagios, and dumping the database. Upgrading to 6.2-STABLE
appeared to solve the problem, with no problems visible while running
1,000 cycles of the sequence which seemed to provoke the problem.

cvsup for this version of the kernel and userland was run at 01:20 GMT
on 06 March.

However, shortly after 15:15 last Sunday afternoon the machine hung
again "out of the blue". kdb diagnostics were taken some 12 hours later,
and look somewhat odd. Maybe it was left to fester for too long.

ps etc output at http://www.stade.co.uk/crash/console which contains
boot to boot serial console output, including some output from test
cycles. I'd be grateful for any expert comments on the ps etc output.

Supporting stuff. 

[root@beastie ~/crash]# df
Filesystem    1K-blocks     Used    Avail Capacity  Mounted on
/dev/mlxd0s1a    507630    70074   396946    15%    /
devfs                 1        1        0   100%    /dev
/dev/mlxd0s1f  63541498 44355014 14103166    76%    /home
/dev/mlxd0s1e  16244334  6784900  8159888    45%    /usr
/dev/mlxd0s1d   1012974   117456   814482    13%    /var
/dev/md0           1646       32     1484     2%    /home/topftp/instances
/dev/md1         253678      132   233252     0%    /tmp

[root@beastie ~]# find /var -inum 23 -ls
    23        4 -rw-r--r--    1 daemon           daemon                 60 Mar 12 20:22 /var/rwho/whod.xjamesfriis

Problem stopped http and FTP logging soon after 15:14 on Sunday 11, diagnostics taken and machine rebooted around 04:30 on Monday 12.

172.19.112.92 - - [11/Mar/2007:15:14:53 +0000] "GET / HTTP/1.0" 200 688 "-" "check_http/1.89 (nagios-plugins 1.4.3)"
<time passes>
172.19.112.92 - - [12/Mar/2007:04:44:14 +0000] "GET / HTTP/1.0" 200 688 "-" "check_http/1.89 (nagios-plugins 1.4.3)"

Mar 11 15:15:35 beastie ftpd[91652]: connection from appsupcen (10.208.1.134)
Mar 11 15:15:35 beastie ftpd[91652]: FTP LOGIN FROM appsupcen as topftp
Mar 11 15:15:35 beastie ftpd[91652]: session root changed to /home/topftp/instances
Mar 11 15:15:35 beastie ftpd[91652]: put in.env_status.html.gz = 592 bytes (wd: /topftp/appsupcen; chrooted)
<time passes>
Mar 11 15:15:35 beastie ftpd[91652]: rename in.env_status.html.gz env_status.html.gz (wd: /topftp/appsupcen; chrooted)
Mar 12 04:44:31 beastie ftpd[1161]: connection from appsupcen (10.208.1.134)
Mar 12 04:44:31 beastie ftpd[1161]: FTP LOGIN FROM appsupcen as topftp
Mar 12 04:44:31 beastie ftpd[1161]: session root changed to /home/topftp/instances
Mar 12 04:44:31 beastie ftpd[1161]: mkdir topftp/appsupcen (wd: /; chrooted)

Support diary:

15:20
Beastie seems like its crashed and down;

16:54
Beastie is now longer pingable by rjmon1;

04:30 - 04:43
(support person quoting from the documentation I'd provided about what
to do after a hang)
Type "return tilde hash" (CR~#) which will make cu send a break signal to beastie, and should cause beastie to drop into the ddb kernel debugger.
In the following, you may see "more" prompts. Type space at each for the next page.
Type these debugger commands
ps
show pcpu
show allpcpu
show locks
show alllocks
show lockedvnods
trace
alltrace
04:43 - beastie now back up and working now by typing call cpu_reset()
after the above commands to reboot beastie.

AW: preserved and inspected diagnostic output. It looks very unlike
that for previous crashes (without a serial console) where a noticable
feature was many ftpd processes in a UFS state. Possibly "things
happened" in the 12 hour period between the onset of the problem on
Sunday afternoon and the diagnostics being taken on Monday morning.

-- 
Adrian Wontroba
Adrian's Birthday Celebration: Crewe Limelight, Saturday 17 March. David
Hughes and Tiny Tin Lady.  Free but ticketed - email me your postal
address if you want to come. No under 18s.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070313140848.GA89182>