Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 21 Jun 2014 17:01:29 +0100
From:      Andrew Turner <andrew@fubar.geek.nz>
To:        John-Mark Gurney <jmg@funkthat.com>
Cc:        arm@FreeBSD.org
Subject:   Re: AVILA getting close!
Message-ID:  <20140621170129.76e62c27@bender.Home>
In-Reply-To: <20140621070827.GJ31367@funkthat.com>
References:  <20140618225808.GG31367@funkthat.com> <20140620151023.GZ31367@funkthat.com> <20140620200827.1c33c7da@bender.Home> <20140621010804.GD31367@funkthat.com> <20140621070827.GJ31367@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 21 Jun 2014 00:08:28 -0700
John-Mark Gurney <jmg@funkthat.com> wrote:

> John-Mark Gurney wrote this message on Fri, Jun 20, 2014 at 18:08
> -0700:
> > Andrew Turner wrote this message on Fri, Jun 20, 2014 at 20:08
> > +0100:
> > > On Fri, 20 Jun 2014 08:10:24 -0700
> > > John-Mark Gurney <jmg@funkthat.com> wrote:
> > > 
> > > > John-Mark Gurney wrote this message on Wed, Jun 18, 2014 at
> > > > 15:58 -0700:
> > > > > So, w/ the recent couple of patches that alc has provided, I
> > > > > no longer receive kernel panics on my AVILA board!
> > > > > 
> > > > > $ uname -a
> > > > > FreeBSD avila.funkthat.com 11.0-CURRENT FreeBSD 11.0-CURRENT
> > > > > #27 r267333:267349M: Wed Jun 11 09:57:58 PDT 2014
> > > > > jmg@carbon.funkthat.com:/usr/obj/arm.armeb/usr/src.avila/sys/AVILA
> > > > > arm $ uptime 12:15AM  up 1 day, 15 mins, 2 users, load
> > > > > averages: 0.13, 0.11, 0.08
> > > > > 
> > > > > This survived a portsnap extract...  This is all over NFS...
> > > > > 
> > > > > Though the issue that I'm now having is that some binaries
> > > > > (newsyslog) and sometimes other binaries (awk, grep) core
> > > > > dump...
> > > > > 
> > > > > I believe this is an issue w/ rtld, or related...  If I
> > > > > compile newsyslog -static, it works fine...  Otherwise I get
> > > > > a SIGILL, and that is because it jumps off into the weeds..
> > > > > Though gdb on arm isn't very useful..
> > > > 
> > > > ok, so the SIGILL only occures under gdb, and this is because
> > > > single stepping into a RAS sequence doesn't work very well...
> > > > If you set a break point on the return (after the RAS
> > > > sequence), you can get past this...
> > > > 
> > > > I got to the point in rtld.c code:
> > > >     if (obj->pltrel)
> > > >         rel = (const Elf_Rel *) ((caddr_t) obj->pltrel +
> > > > reloff); else
> > > >         rel = (const Elf_Rel *) ((caddr_t) obj->pltrela +
> > > > reloffand was seeing gdb try to execute the pltrela line, but:
> > > > i;
> > > > 
> > > > and was seeing gdb try to execute the pltrela line, but:
> > > > (gdb) print * (const Elf_Rel *) ((caddr_t) obj->pltrela +
> > > > reloff) Error accessing memory address 0x118: Bad address.
> > > > (gdb) print/x obj->pltrela
> > > > $4 = 0x0
> > > > (gdb) print /x reloff
> > > > $5 = 0x118
> > > > (gdb) print obj->pltrel
> > > > $6 = (const Elf_Rel *) 0x94e8
> > > Based on my copy of newsyslog I built for armeb this looks
> > > correct. To verify it could you dump the .dynamic section from
> > > the binary? Something like 'objdump -s newsyslog' will get it.
> > 
> > ok, available at:
> > https://www.funkthat.com/~jmg/20140619/objdump.newsyslog
> > 
> > > > Hun? obj->pltrel is non-zero, so it should have executed the
> > > > other line...
> > > > 
> > > > I recompiled rtld w/ -O0, and sure enough, newsyslog runs
> > > > fine... If I compile w/o -O, or w/ -O1, it fails...
> > > > 
> > > > Comments or suggestions?
> > > 
> > > What is the value of rel after the if statement? In the -O/-O1
> > > case the asm looks like:
> > > 
> > > ldr     r2, [sp, #20]  ; Load obj to r2
> > > ldr     r3, [r2, #124] ; Load obj->pltrel to r3
> > > cmp     r3, #0  ; 0x0  ; if obj->pltrel:
> > > ldrne   r2, [sp, #16]  ;  != NULL: Load reloff to r2
> > > addne   r4, r3, r2     ;  != NULL: Add obj->pltrel + reloff to r4
> > > ldreq   r2, [sp, #20]  ;  == NULL: Load obj to r2
> > > ldreq   r3, [r2, #132] ;  == NULL: Load obj->pltrela to r3
> > > ldreq   r2, [sp, #16]  ;  == NULL: Load reloff to r2
> > > addeq   r4, r2, r3     ;  == NULL: Add obj->pltrela + reloff to r4
> > > 
> > > Given this I could see how gdb gets confused.
> > > 
> > > It may also pay to get the registers from gdb at this point.
> > 
> > Arg! This is frustrating, I'm getting such different behaviors from
> > time to time..  now it isn't having that fault..  but it's getting
> > farther, but...  I this is because our in tree gdb is messed up..
> > 
> > But, I am getting farther...  now the last break at rtld.c:3651
> > looks like it's returning a bogus pointer:
> > (gdb) print *req
> > $12 = {name = 0x9190 "__aeabi_read_tp", hash = 0xf008a80, 
> >   hash_gnu = 0x52432dd3, ventry = 0x2003b1d0, flags = 0x1, 
> >   defobj_out = 0x2003c400, sym_out = 0x20062454, lockstate =
> > 0xbfffeda0}
> > 
> > defobj_out looks bogus to me...  We don't have any object mapped
> > there:
> > (gdb) info shared  
> > >From        To          Syms Read   Shared Object Library
> > 0x200427c8  0x20048814  Yes         /lib/libgcc_s.so.1
> > 0x2007a4e8  0x2017f320  Yes         /lib/libc.so.7
> > 0x20018f14  0x2002c99c  Yes         /libexec/ld-elf.so.1
> > 
> > the data at 0x2003c400 doesn't look like code:
> > (gdb) x/32x 0x2003c400
> > 0x2003c400:     0xd550b87a      0x00000001      0x00000000
> > 0x2003a080 0x2003c410:     0x00000000      0x00000001
> > 0x00000000      0x20051000 0x2003c420:     0x0016a000
> > 0x00143000      0x00000000      0x20051000 0x2003c430:
> > 0x2019dfd0      0x2007a4e8      0x20051034      0x000000a0
> > 0x2003c440:     0x00000000      0x00000007      0x00000002
> > 0x2019bcf0 0x2003c450:     0x00000004      0x00000058
> > 0x00000000      0x00000008 0x2003c460:     0x20051000
> > 0x00000000      0x2019e0b8      0x200713e0 0x2003c470:
> > 0x000040c0      0x00000000      0x00000000      0x200754a0
> > 
> > and then as I stepi out of symlook_global:
> > (gdb) x/6i $pc
> > 0x2001f0b4 <symlook_global+348>:        cmp     r0, #0  ; 0x0
> > 0x2001f0b8 <symlook_global+352>:        moveq   r0, #3  ; 0x3
> > 0x2001f0bc <symlook_global+356>:        movne   r0, #0  ; 0x0
> > 0x2001f0c0 <symlook_global+360>:        add     sp, sp, #36     ;
> > 0x24 0x2001f0c4 <symlook_global+364>:        pop     {r4, r5, r6,
> > r7, lr} 0x2001f0c8 <symlook_global+368>:        bx      lr
> > (gdb) info registers
> > r0             0x20062454       0x20062454
> > r1             0x933b   0x933b
> > r2             0x0      0x0
> > r3             0xa4     0xa4
> > r4             0x0      0x0
> > r5             0xbfffed3c       0xbfffed3c
> > r6             0xbfffed08       0xbfffed08
> > r7             0x20037af4       0x20037af4
> > r8             0x0      0x0
> > r9             0x1      0x1
> > r10            0x8a2c   0x8a2c
> > r11            0xbfffed30       0xbfffed30
> > r12            0x23de   0x23de
> > sp             0xbfffec94       0xbfffec94
The stack is bogus here, the stack should be 8-byte aligned in non-leaf
functions.

> > lr             0x2001efb0       0x2001efb0
> > pc             0x2001f0b4       0x2001f0b4
> > fps            0x0      0x0
> > cpsr           0x60000010       0x60000010
> > 
> > Then stepi till 0x2001f0c8:
> > (gdb) info registers
> > r0             0x0      0x0
> > r1             0x933b   0x933b
> > r2             0x0      0x0
> > r3             0xa4     0xa4
> > r4             0x2003c000       0x2003c000
> > r5             0xbfffed3c       0xbfffed3c
> > r6             0x20037af4       0x20037af4
> > r7             0xbfffece8       0xbfffece8
> > r8             0x0      0x0
> > r9             0x1      0x1
> > r10            0x8a2c   0x8a2c
> > r11            0xbfffed30       0xbfffed30
> > r12            0x23de   0x23de
> > sp             0xbfffeccc       0xbfffeccc
It's still bogus suggesting someone in the call chain failed at
aligning it correctly.

> > lr             0x2003c000       0x2003c000
> > pc             0x2001f0c8       0x2001f0c8
> > fps            0x0      0x0
> > cpsr           0x20000010       0x20000010
> > 
> > and now the lr is bogus...  it transfers control to 0x2003c000 which
> > is before the fault at 0x2003c0f4...  And again, this looks like
> > data, not code:
When the stack is incorrectly aligned all bets are off, I've seen
strange errors when it's not 8-byte aligned. The compiler relies on this
assumption to load/store stack based data.

> > (gdb) x/64x 0x2003c000
> > 0x2003c000:     0xd550b87a      0x00000001      0x2003c200
> > 0xbfffffb8 0x2003c010:     0x00000000      0x00000001
> > 0x00000000      0x00008000 0x2003c020:     0x00012000
> > 0x00009000      0x00008000      0x00000000 0x2003c030:
> > 0x00018724      0x00009ca8      0x00008034      0x000000e0
> > 0x2003c040:     0x00008114      0x00000007      0x00000000
> > 0x00000000 0x2003c050:     0x00000000      0x00000000
> > 0x00000000      0x00000000 0x2003c060:     0x00000000
> > 0x00000000      0x0001881c      0x000094a0 0x2003c070:
> > 0x00000048      0x00000000      0x00000000      0x000094e8
> > 0x2003c080:     0x00000308      0x00000000      0x00000000
> > 0x0000888c 0x2003c090:     0x00008f9c      0x000003b1
> > 0x00009430      0x00000002 0x2003c0a0:     0x00000000
> > 0x00000000      0x0000934e      0x00008180 0x2003c0b0:
> > 0x00000061      0x00008304      0x00000071      0x00000061
> > 0x2003c0c0:     0x00000005      0x0000001f      0x0000000a
> > 0x00000071 0x2003c0d0:     0x000084d8      0x00008558
> > 0x000086c8      0x00000000 0x2003c0e0:     0x00000000
> > 0x2003d000      0x00000000      0x00000000 0x2003c0f0:
> > 0x00000000      0x2003c0f0      0x2003b180      0x00000007
> > 
> > If I continue to stepi from here, it will fault at f4...
> > 
> > This looks like a stack smash issue as the lr we pop off the stack
> > is incorrect..
> 
> So, forgot I could set a watchpoint on the stack address, so I did:
> (gdb) c
> Continuing.
> Watchpoint 9: *0xbfffecc8
> 
> Old value = 0x2001f584
> New value = 0x2003c000
> 0x20019960 in donelist_check (dlp=0xbfffed08, obj=0x2003c000)
>     at /usr/src.avila/libexec/rtld-elf/rtld.c:1380
> 1380            dlp->objs[dlp->num_used++] = obj;
> 
> I tracked dlp, and it's the donelist pointer, and it is allocated
> from symlook_default via donelist_init at the top of the function:
> #define donelist_init(dlp)                                      \
>     ((dlp)->objs = alloca(obj_count * sizeof (dlp)->objs[0]),   \
>     assert((dlp)->objs != NULL),                                \
>     (dlp)->num_alloc = obj_count,                               \
>     (dlp)->num_used = 0)
> 
> Hmm...  since obj_count is a global, and we are in a possibly
> threaded environment, we should probably move assigning num_alloc
> before the alloca, and then use num_alloc instead of obj_count just
> to be safe...  but taht won't be the issue here, as num_used is 0
> in my case...
> 
> I've looked at the assembly for donelist_init for both the working
> and non-working case, and besides an extra store in the good version,
> which I think is just because of the unoptimized code, things look
> the same to me...
I suspect it's more likely the working version is not making
assumptions on the stack alignment, where the broken one is.

Andrew



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140621170129.76e62c27>