Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 13 Mar 2009 16:01:58 +1100
From:      Nick Withers <nick@nickwithers.com>
To:        freebsd-stable@freebsd.org
Subject:   NICs locking up, "*tcp_sc_h"
Message-ID:  <1236920519.1490.30.camel@localhost>

next in thread | raw e-mail | index | archive | help

--=-yvVB+Sk5YAzwJ0XlOtOZ
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

Hello all,

I recently installed my first amd64 system (currently running RELENG_7
from 2009-03-11) to replace an aged ppc box and have been having dramas
with the network locking up.

Breaking into the debugger manually and ps-ing shows the network card
(e.g., "[irq20:  fxp0+]") in state "LL" in "*tcp_sc_h". It seems the
process(es) trying to access the card at the time is / are in state "L"
in "*tcp".

I thought this may have been something-or-other in the fxp driver, so
installed an rl card and sadly ran into the issue again.

The console appears unresponsive, but I can get into the debugger (and
as soon as I have, input I'd sent seems to "go through", e.g., if I hit
"Enter" a couple o' times, nothing happens; when I <Ctrl>+<Alt>+<Esc>
into the debugger a few login prompts pop up before the debugger
output).

A "where" on the fxp / rl process (thread?) gives (transcribed from the
console):
____

Tracing PID 31 tid 100030 td 0xffffff00012016e0
sched_switch() at sched_switch+0xf1
mi_switch() at mi_switch+0x18f
turnstile_wait() at turnstile_wait+0x1cf
_mtx_lock_sleep() at _mtx_lock_sleep+0x76
syncache_lookup() at syncache_lookup+0x176
syncache_expand() at syncache_expand+0x38
tcp_input() at tcp_input+0xa7d
ip_input() at ip_input+0xa8
ether_demux() at ether_demux+0x1b9
ether_input() at ether_input+0x1bb
fxp_intr() at fxp_intr+0x233
ithread_loop() at ithread_loop+0x17f
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
____

A "where" on a process stuck in "*tcp", in this case "[swi4: clock]",
gave the somewhat similar:
____

sched_switch() at sched_switch+0xf1
mi_switch() at mi_switch+0x18f
turnstile_wait() at turnstile_wait+0x1cf
_rw_rlock() at _rw_rlock+0x8c
ipfw_chk() at ipfw_chk+0x3ab2
ipfw_check_out() at ipfw_check_out+0xb1
pfil_run_hooks() at pfil_run_hooks+0x9c
ip_output() at ip_output+0x367
syncache_respond() at syncache_respond+0x2fd
syncache_timer() at syncache_timer+0x15a
(...)
____

In this particular case, the fxp0 card is in a lagg with rl0, but this
problem can be triggered with either card on their own...

The scheduler is SCHED_ULE.

I'm not too sure how to give more useful information that this, I'm
afraid. It's a custom kernel, too... Do I need to supply information on
what code actually exists at the relevant addresses (I'm not at all
clued in on how to do this... Sorry!)? Should I chuck WITNESS,
INVARIANTS et al. in?

I *think* every time this has been triggered there's been a "python2.5"
process in the "*tcp" state. This machine runs net-p2p/deluge and
generally has at least 100 TCP connections on the go at any given time.

Can anyone give me a clue as to what I might do to track this down?
Appreciate any pointers.
--=20
Nick Withers
email: nick@nickwithers.com
Web: http://www.nickwithers.com
Mobile: +61 414 397 446

--=-yvVB+Sk5YAzwJ0XlOtOZ
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (FreeBSD)

iEYEABECAAYFAkm56MYACgkQ3wcG/Pf4WrjPkgCgrfzOiRqDgCVnOx4TnLY1/NLT
9TgAoIghvGP9/lbqKVGh2TRLUenEsb6U
=GWu+
-----END PGP SIGNATURE-----

--=-yvVB+Sk5YAzwJ0XlOtOZ--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1236920519.1490.30.camel>