Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 20 Jun 2010 15:15:54 +0200
From:      Fabian Keil <freebsd-listen@fabiankeil.de>
To:        Lawrence Stewart <lstewart@freebsd.org>
Cc:        freebsd-current@freebsd.org
Subject:   Re: [CFT] SIFTR - Statistical Information For TCP Research: Uncle Lawrence needs YOU!
Message-ID:  <20100620151554.58e486db@r500.local>
In-Reply-To: <4C1E0E14.3090506@freebsd.org>
References:  <4C1492D0.6020704@freebsd.org> <4C1C3922.2050102@freebsd.org> <20100619195823.53a7baaa@r500.local> <4C1DED16.8020209@freebsd.org> <20100620131544.495ddecd@r500.local> <4C1E019F.6060802@freebsd.org> <20100620142841.4803dac3@r500.local> <4C1E0E14.3090506@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/KOfakMhdKJkZMtuiyHRGJ_N
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

Lawrence Stewart <lstewart@freebsd.org> wrote:

> On 06/20/10 22:28, Fabian Keil wrote:
> > Lawrence Stewart<lstewart@freebsd.org>  wrote:
> >
> >> On 06/20/10 21:15, Fabian Keil wrote:
> >>> Lawrence Stewart<lstewart@freebsd.org>   wrote:
> >>>
> >>>> On 06/20/10 03:58, Fabian Keil wrote:
> >>>>> Lawrence Stewart<lstewart@freebsd.org>    wrote:
> >>>>>
> >>>>>> On 06/13/10 18:12, Lawrence Stewart wrote:
> >>>>>
> >>>>>>> The time has come to solicit some external testing for my SIFTR t=
ool.
> >>>>>>> I'm hoping to commit it within a week or so unless problems are d=
iscovered.
> >>>>>
> >>>>>>> I'm interested in all feedback and reports of success/failure, al=
ong
> >>>>>>> with details of the architecture tested and number of CPUs if you=
 would
> >>>>>>> be so kind.
> >>>>>
> >>>>> I got the following hand-transcribed panic maybe a second after
> >>>>> sysctl net.inet.siftr.enabled=3D1
> >>>>>
> >>>>> Fatal trap 12: page fault while in kernel mode
> >>>>> cpuid =3D 1; apic id =3D 01
> >>>>> [...]
> >>>>> current process =3D 12 (swi4: clock)
> >>>>> [ thread pid 12 tid 100006 ]
> >>>>> Stopped at	siftr_chkpkt+0xd0:	addq	$0x1,0x8(%r14)
> >>>>> db>    where
> >>>>> Tracing pid 12 tid 100006 td 0xffffff00034037e0
> >>>>> siftr_chkpt() at siftr_chkpkt+0xd0
> >>>>> pfil_run_hooks() at pfil_run_hooks+0xb4
> >>>>> ip_output() at ip_output+0x382
> >>>>> tcp_output() tcp_output+0xa41
> >>>>> tcp_timer_rexmt() at tcp_timer_rexmt+0x251
> >>>>> softclock() at softclock+0x291
> >>>>> intr_event_execute_handlers() at intr_event_execute_handlers+0x66
> >>>>> ithread_loop at ithread_loop+0x8e
> >>>>> fork_exit() at fork_exit+0x112
> >>>>> fork_trampoline() at fork_trampoline+0xe
> >>>>> --- trap 0, rip =3D 0, rsp =3D 0xffffff800003ad30, rbp =3D 0 ---
> >>>>
> >>>> So I've tracked down the line of code where the page fault is occurr=
ing:
> >>>>
> >>>>            if (dir =3D=3D PFIL_IN)
> >>>>                    ss->n_in++;
> >>>>            else
> >>>>                    ss->n_out++;
> >>>>
> >>>> ss is a DPCPU (dynamic per-cpu) variable used to keep a set of stats
> >>>> per-cpu and is initialised at the start of the function like so:
> >>>>
> >>>>            ss =3D DPCPU_PTR(ss);
> >>>>
> >>>> So for ss to be NULL, that implies DPCPU_PTR() is returning NULL on =
your
> >>>> machine. I know very little about the inner workings of the DPCPU_*
> >>>> macros, but I'm pretty sure the way I use them in SIFTR is correct o=
r at
> >>>> least as intended.
> >>>
> >>> siftr_chkpkt() passes ss to siftr_chkreinject() before dereferencing
> >>> it itself. I think if ss was NULL, the panic should already occur in
> >>> siftr_chkreinject().
> >>
> >> Yes but siftr_chkreinject() only dereferences ss in the exceptional ca=
se
> >> of a malloc failure or duplicate pkt. It's unlikely either case happens
> >> for you and so wouldn't trigger the panic.
> >>
> >>> To be sure I added:
> >>>
> >>> diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c
> >>> index 8bc3498..b9fdfe4 100644
> >>> --- a/sys/netinet/siftr.c
> >>> +++ b/sys/netinet/siftr.c
> >>> @@ -788,6 +788,16 @@ siftr_chkpkt(void *arg, struct mbuf **m, struct =
ifnet *ifp, int dir,
> >>>           if (siftr_chkreinject(*m, dir, ss))
> >>>                   goto ret;
> >>>
> >>> +       if (ss =3D=3D NULL) {
> >>> +           printf("ss is NULL");
> >>> +           ss =3D DPCPU_PTR(ss);
> >>> +           if (ss =3D=3D NULL) {
> >>> +              printf("ss is still NULL");
> >>> +              goto ret;
> >>> +           }
> >>> +        }
> >>> +
> >>> +
> >>>           if (dir =3D=3D PFIL_IN)
> >>>                   ss->n_in++;
> >>>           else
> >>>
> >>> which doesn't seem to affect the problem.
> >>
> >> As in it still panics and the "ss is NULL" message is not printed? I
> >> would have expected to at least see "ss is NULL" printed if my
> >> hypothesis was correct... hmm.
> >
> > Yes, it still panics, but no message is printed.
>=20
> It was just pointed out to me that ss doesn't have to be NULL in order=20
> to cause the page fault (duh). It could also just be a garbage ptr which=
=20
> is why your print statement isn't firing.
>=20
> Can you trigger the panic again and look for some information along the=20
> lines of "fault virtual address =3D ..." as part of the panic info.=20
> Knowing the faulting address would be useful and may help further diagnos=
is.

Fatal trap 12: page fault while in kernel mode
cpuid =3D 1; apic id =3D 01
fault virtual address	=3D 0xffffff7f808f9de8
fault code		=3D supervisor write data, page not present
instruction pointer	=3D 0x20:0xffffffff8241f800
stack pointer		=3D 0x28:0xffffff800003a7d0
frame pointer		=3D 0x28:0xffffff800003a840
code segment		=3D base 0x0, limit 0xfffff, type 0x1b
			=3D DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	=3D interrupt enabled, resume, IOPL =3D 0
current process =3D 12 (swi4: clock)
[ thread pid 12 tid 100006 ]
Stopped at	siftr_chkpkt+0xd0:	addq	$0x1,0x8(%r14)
db> where =20
Tracing pid 12 tid 100006 td 0xffffff00034037e0
siftr_chkpt() at siftr_chkpkt+0xd0
pfil_run_hooks() at pfil_run_hooks+0xb4
ip_output() at ip_output+0x382
tcp_output() tcp_output+0xa41
tcp_timer_rexmt() at tcp_timer_rexmt+0x251
softclock() at softclock+0x291
intr_event_execute_handlers() at intr_event_execute_handlers+0x66
ithread_loop at ithread_loop+0x8e
fork_exit() at fork_exit+0x112
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip =3D 0, rsp =3D 0xffffff800003ad30, rbp =3D 0 ---

> >>>> Could you please go ahead and retest using a GENERIC kernel and see =
if
> >>>> you can reproduce? There could be something in your custom kernel
> >>>> causing the offsets or linker set magic used by the DPCPU bits to br=
eak
> >>>> which in turn is triggering this panic in SIFTR.
> >>>
> >>> I'll retry without pf first, and with GENERIC afterwards.
> >>
> >> Sounds good, thanks.
> >
> > Taking pf (and altq) out of the picture doesn't seem to make
> > a difference.
>=20
> Wouldn't have expected it to. Will be very curious to know if the panic=20
> is triggered in GENERIC.

It's not. I, too, get pfil.c related LORs though:

lock order reversal:
 1st 0xffffffff80e5c568 PFil hook read/write mutex (PFil hook read/write mu=
tex) @ /usr/src/sys/net/pfil.c:77
 2nd 0xffffffff80e5dd68 udp (udp) @ /usr/src/sys/modules/pf/../../contrib/p=
f/net/pf.c:3035
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
_witness_debugger() at _witness_debugger+0x2e
witness_checkorder() at witness_checkorder+0x81e
_rw_rlock() at _rw_rlock+0x5f
pf_socket_lookup() at pf_socket_lookup+0x1c5
pf_test_udp() at pf_test_udp+0x8b0
pf_test() at pf_test+0x1089
pf_check_in() at pf_check_in+0x39
pfil_run_hooks() at pfil_run_hooks+0xcf
ip_input() at ip_input+0x2ae
swi_net() at swi_net+0x151
intr_event_execute_handlers() at intr_event_execute_handlers+0x66
ithread_loop() at ithread_loop+0xb2
fork_exit() at fork_exit+0x12a
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip =3D 0, rsp =3D 0xffffff8000044d30, rbp =3D 0 ---
lock order reversal:
 1st 0xffffffff80e5c568 PFil hook read/write mutex (PFil hook read/write mu=
tex) @ /usr/src/sys/net/pfil.c:77
 2nd 0xffffffff80e5d788 tcp (tcp) @ /usr/src/sys/modules/siftr/../../netine=
t/siftr.c:698
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
_witness_debugger() at _witness_debugger+0x2e
witness_checkorder() at witness_checkorder+0x81e
_rw_rlock() at _rw_rlock+0x5f
siftr_chkpkt() at siftr_chkpkt+0x3c4
pfil_run_hooks() at pfil_run_hooks+0xcf
ip_input() at ip_input+0x2ae
swi_net() at swi_net+0x151
intr_event_execute_handlers() at intr_event_execute_handlers+0x66
ithread_loop() at ithread_loop+0xb2
fork_exit() at fork_exit+0x12a
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip =3D 0, rsp =3D 0xffffff8000044d30, rbp =3D 0 ---

My custom kernel normally doesn't have INVARIANTS and WITNESS
enabled, so I'll try to enable them next.

Fabian

--Sig_/KOfakMhdKJkZMtuiyHRGJ_N
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iEYEARECAAYFAkweFI8ACgkQBYqIVf93VJ2cagCfSbFY5gfh6GeDW3kKiCQBv3+n
OesAn33UJRR0XanMCLSSddSzKN8Tg6v4
=DAmj
-----END PGP SIGNATURE-----

--Sig_/KOfakMhdKJkZMtuiyHRGJ_N--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100620151554.58e486db>