Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Jun 2005 09:38:27 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        =?ISO-8859-1?Q?Eirik_=D8verby?= <ltning@anduin.net>
Cc:        stable@freebsd.org, mlaier@FreeBSD.org
Subject:   Re: NFS-related hang in 5.4?
Message-ID:  <20050620092829.E19830@fledge.watson.org>
In-Reply-To: <CF3CB334-ACF4-4DA5-9CE5-D2C7466DCD10@anduin.net>
References:  <8149D7F8-3FA2-48F5-BF03-9AF813448BF0@anduin.net> <20050619185338.J6413@fledge.watson.org> <CF3CB334-ACF4-4DA5-9CE5-D2C7466DCD10@anduin.net>

next in thread | previous in thread | raw e-mail | index | archive | help
  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-980453451-1119256707=:19830
Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE


On Mon, 20 Jun 2005, Eirik =D8verby wrote:

>> Hmm.  Looks like a bug in dummynet.  ipfw should not be directly re-=20
>> injecting UDP traffic back into the input path from an outbound path,=20
>> or it risks re-entering, generating lock order problems, etc. It should=
=20
>> be getting dropped into the netisr queue to be processed from the=20
>> netisr context.
>
> This problem would exist across all 5.4 installations, both i386 and=20
> amd64? Would it depend on heavy load, or could it theoretically happen=20
> at any time when there's traffic? All three of my fbsd5 servers (dual=20
> opteron, dual p3-1ghz, dual p3-700mhz) are experiencing random hangs=20
> with ~a few weeks between, impression is that if running single-cpu mode=
=20
> they are all stable. All using dummynet in a comparable manner. Ideas?

Yes.  Basically, the network stack avoids recursion in processing for=20
"complicated" packets by deferring processing an offending packet to a=20
thread called the 'netisr'.  Whenever the stack reaches a possible=20
recursion point on a packet, it's supposed to queue the packet for=20
processing 'later' in a per-protocol queue, unwind, and then when the=20
netisr runs, pick up and continue processing.  In the stack trace you=20
provide, dummynet appears to immediately immediately invoke the in-bound=20
network path from the out-bound network path, walking back into the=20
network stack from the outbound path.  This is generally forbidden, for a=
=20
variety of reasons:

- We do allow the in-bound path to call the out-bound path, so that
   protocols like TCP, and services like NFS can turn around packets
   without a context switch.  If further recursion is permitted, the stack
   may overflow.

- Both paths may hold network stack locks over calls in either direction
   -- specifically, we allow protocol locks to be held over calls into the
   socket layer, as the protocol layer drives operation; if a recursive
   call is made, deadlocks can occur due to violating the lock order.  This
   is what is happening in your case.

Pretty much all network code is entirely architecture-independent, so bugs=
=20
typically span architectures, although race conditions can sometimes be=20
hard to reproduce if they require precise timing and multiple processors.

>> Is it possible to configure dummynet out of your configuration, and see=
=20
>> if the problem goes away?
>
> I'm running a test right now, will let you know in the morning.

Thanks.

Robert N M Watson
--0-980453451-1119256707=:19830--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050620092829.E19830>