From owner-freebsd-hackers@freebsd.org Fri Mar 17 12:44:47 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CE7B3D104EC for ; Fri, 17 Mar 2017 12:44:47 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5CFE3151D; Fri, 17 Mar 2017 12:44:47 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v2HCibq1069168 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 17 Mar 2017 14:44:37 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v2HCibq1069168 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id v2HCibLM069167; Fri, 17 Mar 2017 14:44:37 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 17 Mar 2017 14:44:37 +0200 From: Konstantin Belousov To: Steven Hartland Cc: "K. Macy" , "freebsd-hackers@freebsd.org" Subject: Re: Help needed to identify golang fork / memory corruption issue on FreeBSD Message-ID: <20170317124437.GR16105@kib.kiev.ua> References: <27e1a828-5cd9-0755-50ca-d7143e7df117@multiplay.co.uk> <20161206125919.GQ54029@kib.kiev.ua> <8b502580-4d2d-1e1f-9e05-61d46d5ac3b1@multiplay.co.uk> <20161206143532.GR54029@kib.kiev.ua> <18b40a69-4460-faf2-c0ce-7491eca92782@multiplay.co.uk> <20170317082333.GP16105@kib.kiev.ua> <180a601b-5481-bb41-f7fc-67976aabe451@multiplay.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <180a601b-5481-bb41-f7fc-67976aabe451@multiplay.co.uk> User-Agent: Mutt/1.8.0 (2017-02-23) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Mar 2017 12:44:47 -0000 On Fri, Mar 17, 2017 at 11:27:52AM +0000, Steven Hartland wrote: > On 17/03/2017 08:23, Konstantin Belousov wrote: > > On Fri, Mar 17, 2017 at 06:30:49AM +0000, Steven Hartland wrote: > >> Ok I think I've identified the cause. > >> > >> If an alternative signal stack is applied to a non-main thread and that > >> thread calls execve then the signal stack is not cleared. > >> > >> This results in all sorts of badness. > >> > >> Full details, including a small C reproduction case can be found here: > >> https://github.com/golang/go/issues/15658#issuecomment-287276856 > >> > >> So looks like its kernel bug. If anyone has an ideas about that before I > >> look tomorrow that would be appreciated. > > Yes, there is definitely a kernel bug, which should be fixed by the patch > > below. > > > > Still, what I saw when I looked at the issue, is not quite resembling > > potential consequences of the bug. Using wrong memory for signal stack > > would result either in much more significant memory corruption if the > > alt stack range is mapped and used for something unrelated, or in killed > > process on signal delivery, if the range is not mapped. While I saw a > > systematic 'off by 0x10' in some gc structures. > > > > Anyway, patch for the issue you identified: > > > > diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c > > index 29d5dd4b132..9bf3ba66f5c 100644 > > --- a/sys/kern/kern_sig.c > > +++ b/sys/kern/kern_sig.c > > @@ -976,7 +976,6 @@ execsigs(struct proc *p) > > * and are now ignored by default). > > */ > > PROC_LOCK_ASSERT(p, MA_OWNED); > > - td = FIRST_THREAD_IN_PROC(p); > > ps = p->p_sigacts; > > mtx_lock(&ps->ps_mtx); > > while (SIGNOTEMPTY(ps->ps_sigcatch)) { > > @@ -1007,6 +1006,8 @@ execsigs(struct proc *p) > > * Reset stack state to the user stack. > > * Clear set of signals caught on the signal stack. > > */ > > + td = curthread; > > + MPASS(td->td_proc == p); > > td->td_sigstk.ss_flags = SS_DISABLE; > > td->td_sigstk.ss_size = 0; > > td->td_sigstk.ss_sp = 0; > Thanks Kostik, pretty obvious now looking at :) > > Testing here we've seen all sorts of corruption looking things, mainly > around random signals from SIGILL to SIGSEGV but also random kernel > messages including: > pid 4603 (test): sigreturn copying xfpustate failed > pid 5013 (test): sigreturn xfpusave_len = 0x44d9bb > > I'm currently running a test, but its looking good as the test case > usually crashes in a matter of seconds. > > Would you mind if I committed it? I am capable of committing the patches. > > I'm guessing given its nature this is something we'd want MFC'ed and > Errata's issued for all supported versions? MFC will be done for sure. I am not so sure about EN, this is a routine bugfix. For some reasons 10.3 errata might be indeed the only way to get this for 10.x users, but I do not see why bother re/so with 11.0.