From owner-freebsd-hackers@freebsd.org Tue Mar 28 08:23:27 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5C08AD1BFC4 for ; Tue, 28 Mar 2017 08:23:27 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wr0-x232.google.com (mail-wr0-x232.google.com [IPv6:2a00:1450:400c:c0c::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id CCF1B95F for ; Tue, 28 Mar 2017 08:23:26 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wr0-x232.google.com with SMTP id w43so79022947wrb.0 for ; Tue, 28 Mar 2017 01:23:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to; bh=1x9iC+woiHK49a0TwycHH5fU2kiKmpYCKwBII7rsNDU=; b=VJR5ZfhhXRB0R/P2Co+k1ozosj0en4XBWHr+TcsOq4p69xYMmn3UujSOHnUBXvUFh4 SVN1qENXH77dgJ36ZjI/Voc4WXfMa9DtLVXPjthR/GAhk/wd1YgQ8snSzneSyCqu6fWe SOdFlRiXSvQ5Lg4h7bRk7OAfQw7n9irmjX9uXiL2nVfdI9/euphJJrL4B27EnNxLTkK1 VKVX5hTIzWWlZOa4VrRyf5EUC7cmz/XwmhIW6bvafnIS3NG7Ph9N05OnfOd8KaEiBW40 rtmoINn+08CY3ClOpQvSiMcBDK+wqM2FB5BmPQWbItJn39+m5eYrXEyXl8T5yPqRZ5pH ZSlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to; bh=1x9iC+woiHK49a0TwycHH5fU2kiKmpYCKwBII7rsNDU=; b=G1RwtFxTmbwr37P6TgpFb0aVPUClqhb9mDIvcLGqztdckcgyMkIuVswoCnOyIm3rgQ UhCL4LcQxh6Fc0u61IQIsENXgRCDbugfzEL8I7kYf41T9VjI/3N+B4n3h/VJG83CKRDm uc1mYKBQ7ns9GT9AF80BrSEkexwjgzdcNnHyPhJH4CEvlHDNKPKe88Dpxf7QYsGtJe1M A4SIS3WFYTTUwaMClZ3OTlH00OOS1ENOQQ4vN52kwEQNpBjJSDsU4jwBJgbFvWcJsBSO bOJqoEFHG7QFljx+8mztkKDRE2Bnl9S7jENpmlhrLKyeMdDAa9Vrzn24BF/Bn5kF+roS 9gPg== X-Gm-Message-State: AFeK/H15a4KRig3KcQHiNWfKN8HqCdN1U6U3EHRn6njl9PdEfegr7PLaTvpJx+b89UW66jb7 X-Received: by 10.28.203.204 with SMTP id b195mr13480600wmg.51.1490689403957; Tue, 28 Mar 2017 01:23:23 -0700 (PDT) Received: from [10.10.1.58] ([185.97.61.26]) by smtp.gmail.com with ESMTPSA id w2sm2574974wmd.19.2017.03.28.01.23.22 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Mar 2017 01:23:23 -0700 (PDT) Subject: Re: Help needed to identify golang fork / memory corruption issue on FreeBSD To: Konstantin Belousov References: <18b40a69-4460-faf2-c0ce-7491eca92782@multiplay.co.uk> <20170317082333.GP16105@kib.kiev.ua> <180a601b-5481-bb41-f7fc-67976aabe451@multiplay.co.uk> <20170317124437.GR16105@kib.kiev.ua> <5ba92447-945e-6fea-ad4f-f58ac2a0012e@multiplay.co.uk> <20170327161833.GL43712@kib.kiev.ua> <3ec35a46-ae70-35cd-29f8-82e7cebb0eb6@multiplay.co.uk> <20170327164905.GN43712@kib.kiev.ua> <17f29342-f3c0-5940-d012-1a698e59a384@multiplay.co.uk> <20170328075859.GQ43712@kib.kiev.ua> Cc: "K. Macy" , "freebsd-hackers@freebsd.org" From: Steven Hartland Message-ID: <85f86a20-948f-025a-0d09-92ee2a815136@multiplay.co.uk> Date: Tue, 28 Mar 2017 09:23:24 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170328075859.GQ43712@kib.kiev.ua> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Mar 2017 08:23:27 -0000 On 28/03/2017 08:58, Konstantin Belousov wrote: > On Tue, Mar 28, 2017 at 12:32:37AM +0100, Steven Hartland wrote: >> On 27/03/2017 17:49, Konstantin Belousov wrote: >>> On Mon, Mar 27, 2017 at 05:33:49PM +0100, Steven Hartland wrote: >>>> On 27/03/2017 17:18, Konstantin Belousov wrote: >>>>> On Mon, Mar 27, 2017 at 12:47:11PM +0100, Steven Hartland wrote: >>>>>> OK now the similar but unrelated issue with signal stacks is solved I've >>>>>> moved back to the initial issue. >>>>>> >>>>>> I've made some progress with a reproduction case as detailed here: >>>>>> https://github.com/golang/go/issues/15658#issuecomment-288747812 >>>>>> >>>>>> In short it seems that having a running child, while the parent runs GC, >>>>>> is some how responsible for memory corruption in the parent. >>>>>> >>>>>> The reason I believe this is if I run the same GC in the parent after >>>>>> the child exits instead of while its running, I've been unable to >>>>>> reproduce the issue. >>>>>> >>>>>> As the memory segments are COW then the issue might be in VM subsystem. >>>>> Well, it might be, but it is a strange corruption mode to believe. >>>> Indeed, but would you agree the evidence seems to indicate that this may >>>> be the case, as otherwise I would have expected that running the GC >>>> after the child process has exited would have zero impact on the issue. >>>>>> In order to confirm / deny this I was wondering if there was a way to >>>>>> force a full copy of all segments for the child instead of using the COW >>>>>> optimisation. >>>>> No, there is no. By design, copying only occurs on faults, when VM >>>>> detects that the map entry needs copying. Doing the actual copy at fork >>>>> time would require writing a lot of new code. >>>> I noticed in vm_map_copy_entry the following: >>>> /* >>>> * We don't want to make writeable wired pages >>>> copy-on-write. >>>> * Immediately copy these pages into the new map by >>>> simulating >>>> * page faults. The new pages are pageable. >>>> */ >>>> vm_fault_copy_entry(dst_map, src_map, dst_entry, src_entry, >>>> fork_charge); >>>> >>>> I wondered if I could use vm_fault_copy_entry to force the copy on fork? >>> No, the vm_fault_copy_entry() only works with wired entries, e.g. it cannot >>> page in not yet touched page, and the result is also wired. >>> >>>>> Does go have FreeBSD/i386 port ? If yes, is the issue reproducable there ? >>>> Yes it does, I don't currently have i386 machine to test with, I'm >>>> assuming testing i386 on amd64 kernel, would likely not have any effect. >>> Only if the bug is in kernel and not in the go runtime. I am still not >>> convinced that the kernel is the culprit. >>> >>>>> Another blind experiment to try is to comment out call to >>>>> vm_object_collapse() in sys/vm/vm_map.c:vm_map_copy_entry() and see if >>>>> it changes anything. >>>> I'll do that shortly. >>>>> What could be quite interesting is to look at the parent and possibly >>>>> child address map after the error occured, using procstat -v. At >>>>> least for parent, this should be relatively easy to set up, just make >>>>> go runtime spin or pause on panic, instead of exiting, and then use >>>>> procstat. >> Here's both parent and child after a failure in the parent, which I >> obtained by putting the child in a nanosleep loop and only after >> successful GC call I send SIGTERM the child and reap it. >> >> procstat -v 53832 61121 >> PID START END PRT RES PRES REF SHD FLAG >> TP PATH >> 53832 0x400000 0x70e000 r-x 308 601 5 1 CN-- vn >> /root/golang/src/test5/test5 >> 53832 0x70e000 0x951000 r-- 261 601 5 1 CN-- vn >> /root/golang/src/test5/test5 >> 53832 0x951000 0x988000 rw- 31 0 1 0 C--- vn >> /root/golang/src/test5/test5 >> 53832 0x988000 0x9ab000 rw- 18 0 1 0 C--- df >> 53832 0x800951000 0x800b51000 rw- 41 0 1 0 C--- df >> 53832 0x800b51000 0x800c21000 rw- 26 0 1 0 C--- df >> 53832 0x800c21000 0x800c71000 rw- 18 0 1 0 C--- df >> 53832 0x800c71000 0x800cb1000 rw- 1 0 1 0 C--- df >> 53832 0x800cb1000 0x800cf1000 rw- 2 0 1 0 C--- df >> 53832 0x800cf1000 0x800d71000 rw- 3 0 1 0 C--- df >> 53832 0x800d71000 0x800db1000 rw- 1 0 1 0 C--- df >> 53832 0x800db1000 0x800e71000 rw- 3 0 1 0 C--- df >> 53832 0x800e71000 0x800eb1000 rw- 1 1 1 0 ---- df >> 53832 0xc000000000 0xc000001000 rw- 1 1 2 0 CN-- df >> 53832 0xc41fff0000 0xc41fff8000 rw- 3 3 2 0 CN-- df >> 53832 0xc41fff8000 0xc420200000 rw- 251 0 1 0 C--- df >> 53832 0x7ffffffdf000 0x7ffffffff000 rwx 2 0 1 0 C--D df >> 53832 0x7ffffffff000 0x800000000000 r-x 1 1 28 0 ---- ph >> 61121 0x400000 0x70e000 r-x 308 601 5 1 CN-- vn >> /root/golang/src/test5/test5 >> 61121 0x70e000 0x951000 r-- 261 601 5 1 CN-- vn >> /root/golang/src/test5/test5 >> 61121 0x951000 0x988000 rw- 31 0 2 1 CN-- vn >> /root/golang/src/test5/test5 >> 61121 0x988000 0x9ab000 rw- 18 18 2 1 CN-- df >> 61121 0x800951000 0x800b51000 rw- 41 41 2 1 CN-- df >> 61121 0x800b51000 0x800c21000 rw- 26 26 2 1 CN-- df >> 61121 0x800c21000 0x800c71000 rw- 18 18 2 1 CN-- df >> 61121 0x800c71000 0x800cb1000 rw- 1 1 2 1 CN-- df >> 61121 0x800cb1000 0x800cf1000 rw- 2 2 2 1 CN-- df >> 61121 0x800cf1000 0x800d71000 rw- 3 3 2 1 CN-- df >> 61121 0x800d71000 0x800db1000 rw- 1 1 2 1 CN-- df >> 61121 0x800db1000 0x800e71000 rw- 3 3 2 1 CN-- df >> 61121 0xc000000000 0xc000001000 rw- 1 1 2 0 CN-- df >> 61121 0xc41fff0000 0xc41fff8000 rw- 3 3 2 0 CN-- df >> 61121 0xc41fff8000 0xc420200000 rw- 251 0 1 0 C--- df >> 61121 0x7ffffffdf000 0x7ffffffff000 rwx 2 2 2 1 CN-D df >> 61121 0x7ffffffff000 0x800000000000 r-x 1 1 28 0 ---- ph >> >> Should the parent have lost the COW flag to the region starting at >> 0x800e71000? > Note that the region is absent in the child, so most likely this is a new > mmaping occured in the parent after fork. Again, what is the address > where an invalid value was detected ? As I stopped the panic before that I couldn't tell so I've re-run with some debug added just before the panic to capture the addresses of the workbuf structure that the issue was detected in, here goes (parent: 62620, child: 98756): workbuf: 0x800b51800 fatal error: workbuf is not empty workbuf: 0x800a72000 fatal error: workbuf is empty workbuf: 0x800a72000 fatal error: workbuf is not empty procstat -v 62620 98756 PID START END PRT RES PRES REF SHD FLAG TP PATH 62620 0x400000 0x70e000 r-x 309 595 5 1 CN-- vn /root/golang/src/test5/test5 62620 0x70e000 0x951000 r-- 254 595 5 1 CN-- vn /root/golang/src/test5/test5 62620 0x951000 0x988000 rw- 31 0 1 0 C--- vn /root/golang/src/test5/test5 62620 0x988000 0x9ab000 rw- 18 0 1 0 C--- df 62620 0x800951000 0x8009f1000 rw- 36 0 1 0 C--- df 62620 0x8009f1000 0x800ac1000 rw- 28 0 1 0 C--- df 62620 0x800ac1000 0x800ad1000 rw- 16 0 1 0 C--- df 62620 0x800ad1000 0x800b91000 rw- 5 0 1 0 C--- df 62620 0x800b91000 0x800bd1000 rw- 2 0 1 0 C--- df 62620 0x800bd1000 0x800c11000 rw- 2 0 1 0 C--- df 62620 0x800c11000 0x800c51000 rw- 1 1 2 0 CN-- df 62620 0x800c51000 0x800c91000 rw- 1 0 1 0 C--- df 62620 0x800c91000 0x800cd1000 rw- 1 0 1 0 C--- df 62620 0xc000000000 0xc000001000 rw- 1 1 2 0 CN-- df 62620 0xc41fff0000 0xc41fff8000 rw- 3 3 2 0 CN-- df 62620 0xc41fff8000 0xc420200000 rw- 251 0 1 0 C--- df 62620 0x7ffffffdf000 0x7ffffffff000 rwx 2 0 1 0 C--D df 62620 0x7ffffffff000 0x800000000000 r-x 1 1 28 0 ---- ph 98756 0x400000 0x70e000 r-x 309 595 5 1 CN-- vn /root/golang/src/test5/test5 98756 0x70e000 0x951000 r-- 254 595 5 1 CN-- vn /root/golang/src/test5/test5 98756 0x951000 0x988000 rw- 31 0 2 1 CN-- vn /root/golang/src/test5/test5 98756 0x988000 0x9ab000 rw- 18 18 2 1 CN-- df 98756 0x800951000 0x8009f1000 rw- 36 36 2 1 CN-- df 98756 0x8009f1000 0x800ac1000 rw- 28 28 2 1 CN-- df 98756 0x800ac1000 0x800ad1000 rw- 16 16 2 1 CN-- df 98756 0x800ad1000 0x800b91000 rw- 5 5 2 1 CN-- df 98756 0x800b91000 0x800bd1000 rw- 2 2 2 1 CN-- df 98756 0x800bd1000 0x800c11000 rw- 2 2 2 1 CN-- df 98756 0x800c11000 0x800c51000 rw- 1 1 2 0 CN-- df 98756 0x800c51000 0x800c91000 rw- 1 1 2 1 CN-- df 98756 0x800c91000 0x800cd1000 rw- 1 1 2 1 CN-- df 98756 0xc000000000 0xc000001000 rw- 1 1 2 0 CN-- df 98756 0xc41fff0000 0xc41fff8000 rw- 3 3 2 0 CN-- df 98756 0xc41fff8000 0xc420200000 rw- 251 0 1 0 C--- df 98756 0x7ffffffdf000 0x7ffffffff000 rwx 2 2 2 1 CN-D df 98756 0x7ffffffff000 0x800000000000 r-x 1 1 28 0 ---- ph Regards Steve