Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Mar 2017 09:23:24 +0100
From:      Steven Hartland <killing@multiplay.co.uk>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        "K. Macy" <kmacy@freebsd.org>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: Help needed to identify golang fork / memory corruption issue on FreeBSD
Message-ID:  <85f86a20-948f-025a-0d09-92ee2a815136@multiplay.co.uk>
In-Reply-To: <20170328075859.GQ43712@kib.kiev.ua>
References:  <CAHM0Q_Mg662u9D0KJ9knEWWqi9Ydy38qKDnjLt6XaS0ks%2B9-iw@mail.gmail.com> <18b40a69-4460-faf2-c0ce-7491eca92782@multiplay.co.uk> <20170317082333.GP16105@kib.kiev.ua> <180a601b-5481-bb41-f7fc-67976aabe451@multiplay.co.uk> <20170317124437.GR16105@kib.kiev.ua> <5ba92447-945e-6fea-ad4f-f58ac2a0012e@multiplay.co.uk> <20170327161833.GL43712@kib.kiev.ua> <3ec35a46-ae70-35cd-29f8-82e7cebb0eb6@multiplay.co.uk> <20170327164905.GN43712@kib.kiev.ua> <17f29342-f3c0-5940-d012-1a698e59a384@multiplay.co.uk> <20170328075859.GQ43712@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 28/03/2017 08:58, Konstantin Belousov wrote:
> On Tue, Mar 28, 2017 at 12:32:37AM +0100, Steven Hartland wrote:
>> On 27/03/2017 17:49, Konstantin Belousov wrote:
>>> On Mon, Mar 27, 2017 at 05:33:49PM +0100, Steven Hartland wrote:
>>>> On 27/03/2017 17:18, Konstantin Belousov wrote:
>>>>> On Mon, Mar 27, 2017 at 12:47:11PM +0100, Steven Hartland wrote:
>>>>>> OK now the similar but unrelated issue with signal stacks is solved I've
>>>>>> moved back to the initial issue.
>>>>>>
>>>>>> I've made some progress with a reproduction case as detailed here:
>>>>>> https://github.com/golang/go/issues/15658#issuecomment-288747812
>>>>>>
>>>>>> In short it seems that having a running child, while the parent runs GC,
>>>>>> is some how responsible for memory corruption in the parent.
>>>>>>
>>>>>> The reason I believe this is if I run the same GC in the parent after
>>>>>> the child exits instead of while its running, I've been unable to
>>>>>> reproduce the issue.
>>>>>>
>>>>>> As the memory segments are COW then the issue might be in VM subsystem.
>>>>> Well, it might be, but it is a strange corruption mode to believe.
>>>> Indeed, but would you agree the evidence seems to indicate that this may
>>>> be the case, as otherwise I would have expected that running the GC
>>>> after the child process has exited would have zero impact on the issue.
>>>>>> In order to confirm / deny this I was wondering if there was a way to
>>>>>> force a full copy of all segments for the child instead of using the COW
>>>>>> optimisation.
>>>>> No, there is no. By design, copying only occurs on faults, when VM
>>>>> detects that the map entry needs copying. Doing the actual copy at fork
>>>>> time would require writing a lot of new code.
>>>> I noticed in vm_map_copy_entry the following:
>>>>                    /*
>>>>                     * We don't want to make writeable wired pages
>>>> copy-on-write.
>>>>                     * Immediately copy these pages into the new map by
>>>> simulating
>>>>                     * page faults.  The new pages are pageable.
>>>>                     */
>>>>                    vm_fault_copy_entry(dst_map, src_map, dst_entry, src_entry,
>>>>                        fork_charge);
>>>>
>>>> I wondered if I could use vm_fault_copy_entry to force the copy on fork?
>>> No, the vm_fault_copy_entry() only works with wired entries, e.g. it cannot
>>> page in not yet touched page, and the result is also wired.
>>>
>>>>> Does go have FreeBSD/i386 port ?  If yes, is the issue reproducable there ?
>>>> Yes it does, I don't currently have i386 machine to test with, I'm
>>>> assuming testing i386 on amd64 kernel, would likely not have any effect.
>>> Only if the bug is in kernel and not in the go runtime.  I am still not
>>> convinced that the kernel is the culprit.
>>>
>>>>> Another blind experiment to try is to comment out call to
>>>>> vm_object_collapse() in sys/vm/vm_map.c:vm_map_copy_entry() and see if
>>>>> it changes anything.
>>>> I'll do that shortly.
>>>>> What could be quite interesting is to look at the parent and possibly
>>>>> child address map after the error occured, using procstat -v. At
>>>>> least for parent, this should be relatively easy to set up, just make
>>>>> go runtime spin or pause on panic, instead of exiting, and then use
>>>>> procstat.
>> Here's both parent and child after a failure in the parent, which I
>> obtained by putting the child in a nanosleep loop and only after
>> successful GC call I send SIGTERM the child and reap it.
>>
>> procstat -v 53832 61121
>>     PID              START                END PRT  RES PRES REF SHD FLAG
>> TP PATH
>> 53832           0x400000           0x70e000 r-x  308  601 5   1 CN-- vn
>> /root/golang/src/test5/test5
>> 53832           0x70e000           0x951000 r--  261  601 5   1 CN-- vn
>> /root/golang/src/test5/test5
>> 53832           0x951000           0x988000 rw-   31    0 1   0 C--- vn
>> /root/golang/src/test5/test5
>> 53832           0x988000           0x9ab000 rw-   18    0 1   0 C--- df
>> 53832        0x800951000        0x800b51000 rw-   41    0 1   0 C--- df
>> 53832        0x800b51000        0x800c21000 rw-   26    0 1   0 C--- df
>> 53832        0x800c21000        0x800c71000 rw-   18    0 1   0 C--- df
>> 53832        0x800c71000        0x800cb1000 rw-    1    0 1   0 C--- df
>> 53832        0x800cb1000        0x800cf1000 rw-    2    0 1   0 C--- df
>> 53832        0x800cf1000        0x800d71000 rw-    3    0 1   0 C--- df
>> 53832        0x800d71000        0x800db1000 rw-    1    0 1   0 C--- df
>> 53832        0x800db1000        0x800e71000 rw-    3    0 1   0 C--- df
>> 53832        0x800e71000        0x800eb1000 rw-    1    1 1   0 ---- df
>> 53832       0xc000000000       0xc000001000 rw-    1    1 2   0 CN-- df
>> 53832       0xc41fff0000       0xc41fff8000 rw-    3    3 2   0 CN-- df
>> 53832       0xc41fff8000       0xc420200000 rw-  251    0 1   0 C--- df
>> 53832     0x7ffffffdf000     0x7ffffffff000 rwx    2    0 1   0 C--D df
>> 53832     0x7ffffffff000     0x800000000000 r-x    1    1 28   0 ---- ph
>> 61121           0x400000           0x70e000 r-x  308  601 5   1 CN-- vn
>> /root/golang/src/test5/test5
>> 61121           0x70e000           0x951000 r--  261  601 5   1 CN-- vn
>> /root/golang/src/test5/test5
>> 61121           0x951000           0x988000 rw-   31    0 2   1 CN-- vn
>> /root/golang/src/test5/test5
>> 61121           0x988000           0x9ab000 rw-   18   18 2   1 CN-- df
>> 61121        0x800951000        0x800b51000 rw-   41   41 2   1 CN-- df
>> 61121        0x800b51000        0x800c21000 rw-   26   26 2   1 CN-- df
>> 61121        0x800c21000        0x800c71000 rw-   18   18 2   1 CN-- df
>> 61121        0x800c71000        0x800cb1000 rw-    1    1 2   1 CN-- df
>> 61121        0x800cb1000        0x800cf1000 rw-    2    2 2   1 CN-- df
>> 61121        0x800cf1000        0x800d71000 rw-    3    3 2   1 CN-- df
>> 61121        0x800d71000        0x800db1000 rw-    1    1 2   1 CN-- df
>> 61121        0x800db1000        0x800e71000 rw-    3    3 2   1 CN-- df
>> 61121       0xc000000000       0xc000001000 rw-    1    1 2   0 CN-- df
>> 61121       0xc41fff0000       0xc41fff8000 rw-    3    3 2   0 CN-- df
>> 61121       0xc41fff8000       0xc420200000 rw-  251    0 1   0 C--- df
>> 61121     0x7ffffffdf000     0x7ffffffff000 rwx    2    2 2   1 CN-D df
>> 61121     0x7ffffffff000     0x800000000000 r-x    1    1 28   0 ---- ph
>>
>> Should the parent have lost the COW flag to the region starting at
>> 0x800e71000?
> Note that the region is absent in the child, so most likely this is a new
> mmaping occured in the parent after fork.  Again, what is the address
> where an invalid value was detected ?
As I stopped the panic before that I couldn't tell so I've re-run with 
some debug added just before the panic to capture the addresses of the 
workbuf structure that the issue was detected in, here goes (parent: 
62620, child: 98756):

workbuf: 0x800b51800
fatal error: workbuf is not empty
workbuf: 0x800a72000
fatal error: workbuf is empty
workbuf: 0x800a72000
fatal error: workbuf is not empty

procstat -v 62620 98756
   PID              START                END PRT  RES PRES REF SHD FLAG 
TP PATH
62620           0x400000           0x70e000 r-x  309  595 5   1 CN-- vn 
/root/golang/src/test5/test5
62620           0x70e000           0x951000 r--  254  595 5   1 CN-- vn 
/root/golang/src/test5/test5
62620           0x951000           0x988000 rw-   31    0 1   0 C--- vn 
/root/golang/src/test5/test5
62620           0x988000           0x9ab000 rw-   18    0 1   0 C--- df
62620        0x800951000        0x8009f1000 rw-   36    0 1   0 C--- df
62620        0x8009f1000        0x800ac1000 rw-   28    0 1   0 C--- df
62620        0x800ac1000        0x800ad1000 rw-   16    0 1   0 C--- df
62620        0x800ad1000        0x800b91000 rw-    5    0 1   0 C--- df
62620        0x800b91000        0x800bd1000 rw-    2    0 1   0 C--- df
62620        0x800bd1000        0x800c11000 rw-    2    0 1   0 C--- df
62620        0x800c11000        0x800c51000 rw-    1    1 2   0 CN-- df
62620        0x800c51000        0x800c91000 rw-    1    0 1   0 C--- df
62620        0x800c91000        0x800cd1000 rw-    1    0 1   0 C--- df
62620       0xc000000000       0xc000001000 rw-    1    1 2   0 CN-- df
62620       0xc41fff0000       0xc41fff8000 rw-    3    3 2   0 CN-- df
62620       0xc41fff8000       0xc420200000 rw-  251    0 1   0 C--- df
62620     0x7ffffffdf000     0x7ffffffff000 rwx    2    0 1   0 C--D df
62620     0x7ffffffff000     0x800000000000 r-x    1    1 28   0 ---- ph
98756           0x400000           0x70e000 r-x  309  595 5   1 CN-- vn 
/root/golang/src/test5/test5
98756           0x70e000           0x951000 r--  254  595 5   1 CN-- vn 
/root/golang/src/test5/test5
98756           0x951000           0x988000 rw-   31    0 2   1 CN-- vn 
/root/golang/src/test5/test5
98756           0x988000           0x9ab000 rw-   18   18 2   1 CN-- df
98756        0x800951000        0x8009f1000 rw-   36   36 2   1 CN-- df
98756        0x8009f1000        0x800ac1000 rw-   28   28 2   1 CN-- df
98756        0x800ac1000        0x800ad1000 rw-   16   16 2   1 CN-- df
98756        0x800ad1000        0x800b91000 rw-    5    5 2   1 CN-- df
98756        0x800b91000        0x800bd1000 rw-    2    2 2   1 CN-- df
98756        0x800bd1000        0x800c11000 rw-    2    2 2   1 CN-- df
98756        0x800c11000        0x800c51000 rw-    1    1 2   0 CN-- df
98756        0x800c51000        0x800c91000 rw-    1    1 2   1 CN-- df
98756        0x800c91000        0x800cd1000 rw-    1    1 2   1 CN-- df
98756       0xc000000000       0xc000001000 rw-    1    1 2   0 CN-- df
98756       0xc41fff0000       0xc41fff8000 rw-    3    3 2   0 CN-- df
98756       0xc41fff8000       0xc420200000 rw-  251    0 1   0 C--- df
98756     0x7ffffffdf000     0x7ffffffff000 rwx    2    2 2   1 CN-D df
98756     0x7ffffffff000     0x800000000000 r-x    1    1 28   0 ---- ph

     Regards
     Steve



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?85f86a20-948f-025a-0d09-92ee2a815136>