Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 3 Nov 2013 15:24:28 -0500
From:      Diane Bruce <db@db.net>
To:        Ian Lepore <ian@FreeBSD.org>
Cc:        Tim Kientzle <tim@kientzle.com>, freebsd-arm@FreeBSD.org, Jason Evans <jasone@FreeBSD.org>, Howard Su <howard0su@gmail.com>
Subject:   Re: sshd crash
Message-ID:  <20131103202428.GB61596@night.db.net>
In-Reply-To: <1383501978.31172.127.camel@revolution.hippie.lan>
References:  <1383313834.31172.65.camel@revolution.hippie.lan> <CAHNYxxMMF_GJv10drYuQFO%2Bav%2BTdp8OBvJfFZObEZ=tgaBovSA@mail.gmail.com> <1383328423.31172.92.camel@revolution.hippie.lan> <CAHNYxxNiuKP8wfTaZuL%2BBXiLcYA9eU3LBb-659ZBYr-WBSmZeQ@mail.gmail.com> <1383343354.31172.102.camel@revolution.hippie.lan> <EB18203F-C516-4917-9AA4-DBA6E66DAAB6@kientzle.com> <1383399220.31172.116.camel@revolution.hippie.lan> <20131102153953.GA39106@night.db.net> <2F2E1775-A459-4D0F-A464-F41B8A7EAB9B@freebsd.org> <1383501978.31172.127.camel@revolution.hippie.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Nov 03, 2013 at 11:06:18AM -0700, Ian Lepore wrote:
> On Sun, 2013-11-03 at 08:51 -0800, Jason Evans wrote:
> > On Nov 2, 2013, at 8:39 AM, Diane Bruce <db@db.net> wrote:
> > > On Sat, Nov 02, 2013 at 07:33:40AM -0600, Ian Lepore wrote:
> > >> 
> > >> I'm not sure it's a mundane stray-write either.  The routine that's
> > >> asserting is checking to see if the contents of a page are all-zero
> > >> because a jemalloc internal flag is set that says it should be.  I had
> > >> the routine print the non-zero data it found, and it looks like this:
> > >> 
> > >> not-zero at 0 0x20c99000 = 0x20800a00
> > >> not-zero at 1 0x20c99004 = 0x00000001
> > >> not-zero at 2 0x20c99008 = 0x0000002f
> > >> not-zero at 3 0x20c9900c = 0xffffffff
> > >> not-zero at 4 0x20c99010 = 0x00007fff
> > >> not-zero at 5 0x20c99014 = 0x00000003
> > >> not-zero at 96 0x20c99180 = 0x5a5a5a5a
> > >> not-zero at 97 0x20c99184 = 0x5a5a5a5a
> > >> not-zero at 98 0x20c99188 = 0x5a5a5a5a
> > >> 
> > >> The 0x5a continues to the end of the page.  So jemalloc has metadata
> > >> that says it thinks the page is all-zeroes, and the page is a mix of
> > >> data and some zeroes and the 5a junk-fill byte.  It seems more like the
> > >> metadata is in error somehow.  (Maybe a stray write hit the metadata.)
> > 
> > This looks to me like the sort of thing that would happen if the chunk page map were corrupted.  This could happen due to a double free, freeing an interior pointer of a multi-page allocation, or a variety of more complicated errors.  The page is filled with 0x5a bytes, yet jemalloc thinks the page should contain 0x00 bytes, and that implies that the chunk page table claims this is the first use of the page since it was mapped.
> > 
> > Does this problem reproduce on amd64?  If so, I'll dig in and figure out if jemalloc is to blame.  If not on amd64, given enough hand holding re: hardware acquisition and configuration I can probably be convinced to set up an ARM system.
> > 

That's what has us confounded. It's 100% repeatable but has not
been seen on amd64.


> 
> FWIW, I noticed when re-examining that data yesterday that the 0x5a
> doesn't continue to the end of the page, it continues until word 328,
> then the rest of the page is zeroes.  I assume that's still consistant
> with a double-free and other such usage errors.

That's inconsistent with what I remembering seeing here. I will
look at my dump as well. The entire page was wrong.

What I have been doing is replacing your various memfills with
a different pattern than 0xa5 in the hope of catching who is doing
what. I'll dig up my various diffs and dumps and ship them off to
you if you wish. The path I saw showed the pattern came from
something tcache did but I could not seem to turn off tcache
using ln -s "tcache:false" /etc/malloc.conf

> 
> An interesting part of this problem is that the changeset that
> introduced this problem is the one that makes the malloc-related symbols
> in libc weak references to the jemalloc implementation.  Diane sees some
> evidence in gdb that there is a non-jemalloc implementation of malloc
> present in the process.  I wonder if we've got something like a mix of
> statically and dynamically linked code and thus two mallocs somehow?


I can confirm this. An older libc.so cures the problem on ARM.

The version of malloc seen is not jemalloc in this case.

> 
> Would allocating a block from one malloc implementation then freeing it
> to the other be consistant with that asserted data above?
> 
> I think if this happened on x86 we'd be hearing from a LOT of folks
> about it.  I wonder if it reproduces in an arm emulation environment?  I
> don't know anything about using emulation, but others here do.

Agreed.

Ian and me just discussed on IRC.  It would be great if this
bug is also in the emulation. Otherwise, we will get you hardware
and help. 



> 
> -- Ian
> 
> 

- Diane
-- 
- db@FreeBSD.org db@db.net http://www.db.net/~db



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131103202428.GB61596>