Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 7 Dec 2016 23:49:50 +0000
From:      Steven Hartland <killing@multiplay.co.uk>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: Help needed to identify golang fork / memory corruption issue on FreeBSD
Message-ID:  <37d0e944-5104-3db0-2884-be6fa80bc95d@multiplay.co.uk>
In-Reply-To: <20161207143051.GX54029@kib.kiev.ua>
References:  <27e1a828-5cd9-0755-50ca-d7143e7df117@multiplay.co.uk> <20161206125919.GQ54029@kib.kiev.ua> <8b502580-4d2d-1e1f-9e05-61d46d5ac3b1@multiplay.co.uk> <20161206143532.GR54029@kib.kiev.ua> <9b40c93a-871f-bb32-668c-39bc3e31e385@multiplay.co.uk> <20161207121449.GV54029@kib.kiev.ua> <20161207143051.GX54029@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 07/12/2016 14:30, Konstantin Belousov wrote:
> On Wed, Dec 07, 2016 at 02:14:49PM +0200, Konstantin Belousov wrote:
>> On Tue, Dec 06, 2016 at 08:35:04PM +0000, Steven Hartland wrote:
>>> On 06/12/2016 14:35, Konstantin Belousov wrote:
>>>> On Tue, Dec 06, 2016 at 01:53:52PM +0000, Steven Hartland wrote:
>>>>> On 06/12/2016 12:59, Konstantin Belousov wrote:
>>>>>> On Tue, Dec 06, 2016 at 12:31:47PM +0000, Steven Hartland wrote:
>>>>>>> Hi guys I'm trying to help identify / fix an issue with golang where by
>>>>>>> fork results in memory corruption.
>>>>>>>
>>>>>>> Details of the issue can be found here:
>>>>>>> https://github.com/golang/go/issues/15658
>>>>>>>
>>>>>>> In summary when a fork is done in golang is has a chance of causing
>>>>>>> memory corruption in the parent resulting in a process crash once detected.
>>>>>>>
>>>>>>> Its believed that this only effects FreeBSD.
>>>>>>>
>>>>>>> This has similarities to other reported issues such as this one which
>>>>>>> impacted perl during 10.x:
>>>>>>> https://rt.perl.org/Public/Bug/Display.html?id=122199
>>>>>> I cannot judge about any similarilities when all the description provided
>>>>>> is 'memory corruption'. BTW, the perl issue described, where child segfaults
>>>>>> after the fork, is more likely to be caused by the set of problems referenced
>>>>>> in the FreeBSD-EN-16:17.vm.
>>>>>>
>>>>>>> And more recently the issue with nginx on 11.x:
>>>>>>> https://lists.freebsd.org/pipermail/freebsd-stable/2016-September/085540.html
>>>>>> Which does not affect anything unless aio is used on Sandy/Ivy.
>>>>>>
>>>>>>> Its possible, some believe likely, that this is a kernel bug around fork
>>>>>>> / vm that golang stresses, but I've not been able to confirm.
>>>>>>>
>>>>>>> I can reproduce the issue at will, takes between 5mins and 1hour using
>>>>>>> 16 threads, and it definitely seems like an interaction between fork and
>>>>>>> other memory operations.
>>>>>> Which arch is the kernel and the process which demonstrates the behaviour  ?
>>>>>> I mean i386/amd64.
>>>>> amd64
>>>> How large is the machine, how many cores, what is the physical memory size ?
>> I was able to reproduce that as well, reliably, on two desktop-size
>> machines. One is SandyBridge, same core microarchitecture as your
>> crashbox, another is Haswell. I see the error both with PCID enabled
>> and disabled on both machines (Haswell does implement INVPCID, so the
>> original aio/PCID bug did never affected this microarchitecture).
>>
>> I believe this clears the PCID changes from the accusations.
>>
>>>>>>> I've tried reproducing the issue in C but also no joy (captured in the bug).
>>>>>>>
>>>>>>> For reference I'm currently testing on 11.0-RELEASE-p3 + kibs PCID fix
>>>>>>> (#306350).
>>>>>> Switch to HEAD kernel, for start.
>>>>>> Show the memory map of the failed process.
>>> No sign of zeroed memory that I can tell.
>>>
>>> This error was caused by hitting the following validation in gc:
>>> func (list *mSpanList) remove(span *mspan) {
>>>           if span.prev == nil || span.list != list {
>>>                   println("runtime: failed MSpanList_Remove", span,
>>> span.prev, span.list, list)
>>>                   throw("MSpanList_Remove")
>>>           }
>>>
>>> runtime: failed MSpanList_Remove 0x80052e580 0x80052e300 0x53e9c0 0x53e9b0
>>> fatal error: MSpanList_Remove
>>>
>>> (gdb) print list
>>> $4 = (runtime.mSpanList *) 0x53e9b0 <runtime.mheap_+4944>
>>> (gdb) print span.list
>>> $5 = (runtime.mSpanList *) 0x53e9c0 <runtime.mheap_+4960>
>> The difference, which triggered the exception, is quite curious:
>> list is 0x53e9b0, and span.list == list + 0x10.  More, this is not
>> a single-bit error: bit patter is 1011 for 0xb and 1100 for 0xc.
>>
>> It is highly unlikely that the cause is a memory corruption due to
>> OS mis-managing pages or TLB.  Typically, you get either page or cache
>> line of complete garbage, instead of the almost identical but slightly
>> modified data.
>>
>>> (gdb) print span.prev
>>> $6 = (struct runtime.mspan **) 0x80052e300
>>> (gdb) print *list
>>> $7 = {first = 0x80052e580, last = 0x8008aa180}
>>> (gdb) print *span.list
>>> $8 = {first = 0x8007ea7e0, last = 0x80052e580}
>>>
>>> procstat -v test.core.1481054183
>>>     PID              START                END PRT  RES PRES REF SHD FLAG
>>> TP PATH
>>>    1178           0x400000           0x49b000 r-x  115  223 3   1 CN-- vn
>>> /root/test
>>>    1178           0x49b000           0x528000 r--   97  223 3   1 CN-- vn
>>> /root/test
>>>    1178           0x528000           0x539000 rw-   10    0 1   0 C--- vn
>>> /root/test
>>>    1178           0x539000           0x55a000 rw-   16   16 1   0 C--- df
>>>    1178        0x800528000        0x800a28000 rw-  118  118 1   0 C--- df
>>>    1178        0x800a28000        0x800a68000 rw-    1    1 1   0 CN-- df
>>>    1178        0x800a68000        0x800aa8000 rw-    2    2 1   0 CN-- df
>>>    1178        0x800aa8000        0x800c08000 rw-   50   50 1   0 CN-- df
>>>    1178        0x800c08000        0x800c48000 rw-    2    2 1   0 CN-- df
>>>    1178        0x800c48000        0x800c88000 rw-    1    1 1   0 CN-- df
>>>    1178        0x800c88000        0x800cc8000 rw-    1    1 1   0 CN-- df
>>>    1178       0xc000000000       0xc000001000 rw-    1    1 1   0 CN-- df
>>>    1178       0xc41ffe0000       0xc41ffe8000 rw-    8    8 1   0 CN-- df
>>>    1178       0xc41ffe8000       0xc41fff0000 rw-    8    8 1   0 CN-- df
>>>    1178       0xc41fff0000       0xc41fff8000 rw-    8    8 1   0 C--- df
>>>    1178       0xc41fff8000       0xc420300000 rw-  553  553 1   0 C--- df
>>>    1178       0xc420300000       0xc420400000 rw-  234  234 1   0 C--- df
>>>    1178     0x7ffffffdf000     0x7ffffffff000 rwx    2    2 1   0 C--D df
>>>    1178     0x7ffffffff000     0x800000000000 r-x    1    1 33   0 ---- ph
>>>
>>> This is from FreeBSD 12.0-CURRENT #36 r309618M
>>>
>>> ktrace on 11.0-RELEASE is still running 6 hours so far.
> One thing that I noted.  In my later attempt to reproduce the issue, I
> got the following output:
>
> sandy% GOGC=2 ./1.go                                                     /mnt/1
> Starting 16 forking goroutines...
> GOMAXPROCS: 8
> runtime: failed MSpanList_Remove 0x8006a9d60 0x53f2e0 0x53f2f0 0x53f2e0
> fatal error: MSpanList_Remove
>
> runtime stack:
> runtime.throw(0x4cca4d, 0x10)
>          /usr/local/go/src/runtime/panic.go:566 +0x95
> runtime.(*mSpanList).remove(0x53f2e0, 0x8006a9d60)
>          /usr/local/go/src/runtime/mheap.go:1001 +0x19d
> runtime.(*mcentral).cacheSpan(0x53f2d0, 0x44abfb)
>          /usr/local/go/src/runtime/mcentral.go:55 +0x3d0
> runtime.(*mcache).refill(0x80052a0d0, 0xc400000016, 0xc420200d38)
>          /usr/local/go/src/runtime/mcache.go:121 +0xae
> runtime.(*mcache).nextFree.func1()
>          /usr/local/go/src/runtime/malloc.go:505 +0x33
> runtime.systemstack(0xc420018000)
>          /usr/local/go/src/runtime/asm_amd64.s:298 +0x79
> runtime.mstart()
>          /usr/local/go/src/runtime/proc.go:1079
>
> goroutine 8797810 [running]:
> runtime.systemstack_switch()
>          /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc420200be0 sp=0xc420200bd8
> runtime.(*mcache).nextFree(0x80052a0d0, 0xc420200c16, 0x40ec95, 0xc42011c050, 0x10)
>          /usr/local/go/src/runtime/malloc.go:506 +0xb2 fp=0xc420200c38 sp=0xc420200be0
> runtime.mallocgc(0x1a0, 0x4aaf60, 0xc420200d01, 0x44ab40)
>          /usr/local/go/src/runtime/malloc.go:658 +0x809 fp=0xc420200cd8 sp=0xc420200c38
> runtime.makeslice(0x4aaf60, 0x0, 0x19, 0xc4200b8688, 0x8, 0x8)
>          /usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc420200d30 sp=0xc420200cd8
> syscall.Environ(0x0, 0x0, 0x0)
>          /usr/local/go/src/syscall/env_unix.go:142 +0xd0 fp=0xc420200dc0 sp=0xc420200d30
> os.Environ(0x0, 0x0, 0x0)
>          /usr/local/go/src/os/env.go:116 +0x22 fp=0xc420200de8 sp=0xc420200dc0
> os/exec.(*Cmd).envv(0xc4201422c0, 0xc4200b8680, 0x0, 0x1)
>          /usr/local/go/src/os/exec/exec.go:171 +0x38 fp=0xc420200e10 sp=0xc420200de8
> os/exec.(*Cmd).Start(0xc4201422c0, 0x6, 0x0)
>
> There are two traces for goroutines, and note that both first and second
> are in malloc.  Is go malloc fine-grain locked ?  E.g. I know that JVM
> uses per-thread arenas.
I'm not particularly familiar with the golang memory allocator, as I 
only started digging into this a few days ago, but malloc is pool based.

One article which specifies some details about the internals of the 
allocator I just found is:
https://blog.altoros.com/golang-internals-part-6-bootstrapping-and-memory-allocator-initialization.html
> In fact, are there stress-tests for the go mutual exclusion primitives ?
go has a full test suite built in which is run when its built by 
default, components of it also include benchmarks.

You can run tests for the current directory with verbose output using:
go test -v
or for all directories below the current one with:
go test -v ./...

to also run all benchmarks add:
-bench=.

for more information about test which has loads of nice options 
including profiling:
go test -h

> The runtime seems to try to use thr and umtx syscalls directly, which
> could be the source of bugs.
Yes it could well be.

Looking through the runtime code I spotted it uses sys sigprocmask 
directly in src/runtime/os_freebsd.go may also have potential to cause 
problems given libthr does additional work on top the syscall and that 
sigprocmask(8) explicitly warns against the use of sigprocmask for 
thread apps:
https://github.com/golang/go/blob/master/src/runtime/os_freebsd.go#L135

I've now got a failure with ktrace running took over 24hours so its 
quite large so I've uploaded it along side the log and exact binary 
build to:
https://blog.multiplay.co.uk/dropzone/freebsd/golang-panic.tar.gz

The crash was the same as before:
runtime: failed MSpanList_Remove 0x80052d860 0x8007abbc0 0x53ea30 0x53ea20
fatal error: MSpanList_Remove

Given the offset is the same again this doesn't really feel like random 
corruption.

     Regards
     Steve



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?37d0e944-5104-3db0-2884-be6fa80bc95d>