Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 20 Jun 2010 14:28:41 +0200
From:      Fabian Keil <freebsd-listen@fabiankeil.de>
To:        Lawrence Stewart <lstewart@freebsd.org>
Cc:        freebsd-current@freebsd.org
Subject:   Re: [CFT] SIFTR - Statistical Information For TCP Research: Uncle Lawrence needs YOU!
Message-ID:  <20100620142841.4803dac3@r500.local>
In-Reply-To: <4C1E019F.6060802@freebsd.org>
References:  <4C1492D0.6020704@freebsd.org> <4C1C3922.2050102@freebsd.org> <20100619195823.53a7baaa@r500.local> <4C1DED16.8020209@freebsd.org> <20100620131544.495ddecd@r500.local> <4C1E019F.6060802@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/Y6Nt./LTWzcbybArOHJ5VM7
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

Lawrence Stewart <lstewart@freebsd.org> wrote:

> On 06/20/10 21:15, Fabian Keil wrote:
> > Lawrence Stewart<lstewart@freebsd.org>  wrote:
> >
> >> On 06/20/10 03:58, Fabian Keil wrote:
> >>> Lawrence Stewart<lstewart@freebsd.org>   wrote:
> >>>
> >>>> On 06/13/10 18:12, Lawrence Stewart wrote:
> >>>
> >>>>> The time has come to solicit some external testing for my SIFTR too=
l.
> >>>>> I'm hoping to commit it within a week or so unless problems are dis=
covered.
> >>>
> >>>>> I'm interested in all feedback and reports of success/failure, along
> >>>>> with details of the architecture tested and number of CPUs if you w=
ould
> >>>>> be so kind.
> >>>
> >>> I got the following hand-transcribed panic maybe a second after
> >>> sysctl net.inet.siftr.enabled=3D1
> >>>
> >>> Fatal trap 12: page fault while in kernel mode
> >>> cpuid =3D 1; apic id =3D 01
> >>> [...]
> >>> current process =3D 12 (swi4: clock)
> >>> [ thread pid 12 tid 100006 ]
> >>> Stopped at	siftr_chkpkt+0xd0:	addq	$0x1,0x8(%r14)
> >>> db>   where
> >>> Tracing pid 12 tid 100006 td 0xffffff00034037e0
> >>> siftr_chkpt() at siftr_chkpkt+0xd0
> >>> pfil_run_hooks() at pfil_run_hooks+0xb4
> >>> ip_output() at ip_output+0x382
> >>> tcp_output() tcp_output+0xa41
> >>> tcp_timer_rexmt() at tcp_timer_rexmt+0x251
> >>> softclock() at softclock+0x291
> >>> intr_event_execute_handlers() at intr_event_execute_handlers+0x66
> >>> ithread_loop at ithread_loop+0x8e
> >>> fork_exit() at fork_exit+0x112
> >>> fork_trampoline() at fork_trampoline+0xe
> >>> --- trap 0, rip =3D 0, rsp =3D 0xffffff800003ad30, rbp =3D 0 ---
> >>
> >> So I've tracked down the line of code where the page fault is occurrin=
g:
> >>
> >>           if (dir =3D=3D PFIL_IN)
> >>                   ss->n_in++;
> >>           else
> >>                   ss->n_out++;
> >>
> >> ss is a DPCPU (dynamic per-cpu) variable used to keep a set of stats
> >> per-cpu and is initialised at the start of the function like so:
> >>
> >>           ss =3D DPCPU_PTR(ss);
> >>
> >> So for ss to be NULL, that implies DPCPU_PTR() is returning NULL on yo=
ur
> >> machine. I know very little about the inner workings of the DPCPU_*
> >> macros, but I'm pretty sure the way I use them in SIFTR is correct or =
at
> >> least as intended.
> >
> > siftr_chkpkt() passes ss to siftr_chkreinject() before dereferencing
> > it itself. I think if ss was NULL, the panic should already occur in
> > siftr_chkreinject().
>=20
> Yes but siftr_chkreinject() only dereferences ss in the exceptional case=
=20
> of a malloc failure or duplicate pkt. It's unlikely either case happens=20
> for you and so wouldn't trigger the panic.
>=20
> > To be sure I added:
> >
> > diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c
> > index 8bc3498..b9fdfe4 100644
> > --- a/sys/netinet/siftr.c
> > +++ b/sys/netinet/siftr.c
> > @@ -788,6 +788,16 @@ siftr_chkpkt(void *arg, struct mbuf **m, struct if=
net *ifp, int dir,
> >          if (siftr_chkreinject(*m, dir, ss))
> >                  goto ret;
> >
> > +       if (ss =3D=3D NULL) {
> > +           printf("ss is NULL");
> > +           ss =3D DPCPU_PTR(ss);
> > +           if (ss =3D=3D NULL) {
> > +              printf("ss is still NULL");
> > +              goto ret;
> > +           }
> > +        }
> > +
> > +
> >          if (dir =3D=3D PFIL_IN)
> >                  ss->n_in++;
> >          else
> >
> > which doesn't seem to affect the problem.
>=20
> As in it still panics and the "ss is NULL" message is not printed? I=20
> would have expected to at least see "ss is NULL" printed if my=20
> hypothesis was correct... hmm.

Yes, it still panics, but no message is printed.

> Perhaps the way I discovered the line number at which the panic occurred=
=20
> was wrong. I compiled SIFTR on my amd64 dev server with "CFLAGS+=3D-g" in=
=20
> the SIFTR Makefile to get debug symbols, ran "objdump -Sd siftr.ko | vim=
=20
> -", searched for the instruction reported in the panic message i.e.=20
> "addq $0x1,0x8(%r14)" and then with a bit of trial and error, recompiled=
=20
> SIFTR with the line of code "volatile int blah =3D 0; blah =3D 2;" at=20
> various points in the function and looking at the change in the objdump=20
> output to pinpoint which line of C code corresponded with the "addq"=20
> instruction.
>=20
> The "volatile int blah =3D 0; blah =3D 2;" compiles to "movl=20
> $0x0,0xffffffffffffffd4(%rbp)" followed immediately by "movl=20
> $0x2,0xffffffffffffffd4(%rbp)". When I put that code above the "if (dir=20
> =3D=3D PFIL_IN)" statement I see the objdump output show the assembly cod=
e=20
> before the "addq" instruction and when I move it after the if statement=20
> the assembly code moves after the "addq" instruction.

That's a neat trick.
=20
> Perhaps you could reproduce the above procedure and see if you identify=20
> the same point in the siftr_chkpkt function I did for the instruction=20
> referenced by the panic message?

I do. Using:

diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c
index b9fdfe4..fc6bd9a 100644
--- a/sys/netinet/siftr.c
+++ b/sys/netinet/siftr.c
@@ -797,12 +797,15 @@ siftr_chkpkt(void *arg, struct mbuf **m, struct ifnet=
 *ifp, int dir,
            }
         }
=20
+        volatile int blah =3D 0; blah =3D 2;
=20
        if (dir =3D=3D PFIL_IN)
                ss->n_in++;
        else
                ss->n_out++;
=20
+        volatile int foo =3D 0; foo =3D 3;
+
        /*
         * Create a tcphdr struct starting at the correct offset
         * in the IP packet. ip->ip_hl gives the ip header length

I get:

     803:       c7 45 d4 00 00 00 00    movl   $0x0,0xffffffffffffffd4(%rbp)
     80a:       c7 45 d4 02 00 00 00    movl   $0x2,0xffffffffffffffd4(%rbp)
     811:       0f 84 8a 02 00 00       je     aa1 <siftr_chkpkt+0x371>
     817:       49 83 46 08 01          addq   $0x1,0x8(%r14)
     81c:       c7 45 d0 00 00 00 00    movl   $0x0,0xffffffffffffffd0(%rbp)
     823:       c7 45 d0 03 00 00 00    movl   $0x3,0xffffffffffffffd0(%rbp)

> >> Could you please go ahead and retest using a GENERIC kernel and see if
> >> you can reproduce? There could be something in your custom kernel
> >> causing the offsets or linker set magic used by the DPCPU bits to break
> >> which in turn is triggering this panic in SIFTR.
> >
> > I'll retry without pf first, and with GENERIC afterwards.
>=20
> Sounds good, thanks.

Taking pf (and altq) out of the picture doesn't seem to make
a difference.

Fabian

--Sig_/Y6Nt./LTWzcbybArOHJ5VM7
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iEYEARECAAYFAkweCXsACgkQBYqIVf93VJ1qcwCeMBy8mvXtUBqp3BUkD9K8fK6x
9GYAn3enB+kNR4h0jO8BS8YEYJA2z/dD
=ifET
-----END PGP SIGNATURE-----

--Sig_/Y6Nt./LTWzcbybArOHJ5VM7--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100620142841.4803dac3>