From owner-freebsd-hackers@FreeBSD.ORG Mon Sep 7 10:59:58 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B04D11065679 for ; Mon, 7 Sep 2009 10:59:58 +0000 (UTC) (envelope-from rivanr@gmail.com) Received: from mail-bw0-f206.google.com (mail-bw0-f206.google.com [209.85.218.206]) by mx1.freebsd.org (Postfix) with ESMTP id 3C9498FC1C for ; Mon, 7 Sep 2009 10:59:58 +0000 (UTC) Received: by bwz2 with SMTP id 2so278940bwz.43 for ; Mon, 07 Sep 2009 03:59:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:content-type :content-transfer-encoding; bh=VTpENWKrHuY+cvqqBVH0oyIJUs0WYio5TdfOsBbe7ds=; b=lzFmTN6GDKausJdDWnNzxnd52ga98lx7W+yceOrhY73ipjnpRpy3MTIJmzFQm4tX7R f9i8FcJyIr0Y1e/rff+Y6t39G4UmKwxEFmYyP7+lUBL6L/wN4/TcW2NJhgs8GvqKVZKh doLurhHmKIEHdHP3UI+xkcqzpaEMMtlfUjsbs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; b=KOTxQFgiMTzyx2QQ5Hx9PyXMOVKSD/jiHZPgS2a3bcURAZE/kmS80iiZPpEfNHrmrp qDCfchmzl5jZqvzYuaaVV1jUUeWRAdLGhqu6ZAAQobfzM+gO7wyZ+4ngwoPn55U0kCyO zHe5I4oTECN40UF2uQ6jJKDH/bpZQUIVgMn+8= Received: by 10.204.8.13 with SMTP id f13mr11940352bkf.150.1252321197018; Mon, 07 Sep 2009 03:59:57 -0700 (PDT) Received: from azdaja.softwarehood.com ([95.180.33.218]) by mx.google.com with ESMTPS id p9sm6949840fkb.37.2009.09.07.03.59.52 (version=TLSv1/SSLv3 cipher=RC4-MD5); Mon, 07 Sep 2009 03:59:52 -0700 (PDT) Message-ID: <4AA4E7A7.60503@gmail.com> Date: Mon, 07 Sep 2009 12:59:51 +0200 From: Ivan Radovanovic User-Agent: Thunderbird 2.0.0.22 (X11/20090708) MIME-Version: 1.0 To: freebsd-hackers@FreeBSD.org Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Subject: Kernel panic caused by fork X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Sep 2009 10:59:58 -0000 I was testing FreeBSD's behavior when running many threads at the same time (and I find it performs excellent) when I wanted to test how system will behave towards program that spawns itself too many times. I wrote a very simple program #include #include int main() { while(1) fork(); return 0; } After running this program I got kernel panic with message "get_pv_entry: increase vm.pmap.shpgperproc" IMHO it is not very good idea to bring entire system down if one process misbehaves in this way, it is maybe much better to kill offending process and to send this message to system log. I am not sure whether the panic is actually caused by process forking forever or when the system tries to create new process when maxproc limit is already reached (since system is only printing warning message that maxproc limit is reached and it only panics when I try to start new process (like ps)). System is FreeBSD 7.2-STABLE kernel backtrace: (kgdb) bt #0 doadump () at pcpu.h:196 #1 0xc05fc477 in boot (howto=260) at ../../../kern/kern_shutdown.c:418 #2 0xc05fc782 in panic (fmt=Variable "fmt" is not available. ) at ../../../kern/kern_shutdown.c:574 #3 0xc087bccf in get_pv_entry (pmap=0xca0cb43c, try=0) at ../../../i386/i386/pmap.c:2067 #4 0xc087c0db in pmap_insert_entry (pmap=Variable "pmap" is not available. ) at ../../../i386/i386/pmap.c:2203 #5 0xc087f08e in pmap_enter (pmap=0xca0cb43c, va=671973376, access=1 '\001', m=Variable "m" is not available. ) at ../../../i386/i386/pmap.c:3114 #6 0xc082a947 in vm_fault (map=0xca0cb3b0, vaddr=671973376, fault_type=1 '\001', fault_flags=0) at ../../../vm/vm_fault.c:891 #7 0xc0881acb in trap_pfault (frame=0xefc1bd38, usermode=1, eva=671975739) at ../../../i386/i386/trap.c:828 #8 0xc0882420 in trap (frame=0xefc1bd38) at ../../../i386/i386/trap.c:396 #9 0xc086724b in calltrap () at ../../../i386/i386/exception.s:166 #10 0x280d893b in ?? () Previous frame inner to this frame (corrupt stack?) From owner-freebsd-hackers@FreeBSD.ORG Tue Sep 8 09:09:17 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id ED459106568D for ; Tue, 8 Sep 2009 09:09:17 +0000 (UTC) (envelope-from rivanr@gmail.com) Received: from mail-bw0-f206.google.com (mail-bw0-f206.google.com [209.85.218.206]) by mx1.freebsd.org (Postfix) with ESMTP id 76C698FC27 for ; Tue, 8 Sep 2009 09:09:17 +0000 (UTC) Received: by bwz2 with SMTP id 2so796133bwz.43 for ; Tue, 08 Sep 2009 02:09:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=hxTcQgRU1X7ndqKzHO671baC+88AFv9jCbHhJqyRr/s=; b=SbLurKoJcbcEHPaDzbc2LnfxfxdqXlXputOU02jpXXNfMTuAHpNocLQpnouLKEzIkq d8MuLZjha1TT9IMe1UgHtvNexVT0zJD32nnsJBKR7E+Nrl4Pt7hByySpU2yeXCuSotY/ 9ARa0CrYN8oxNHHM3YmBRGw67FmFkyIkS7mgg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; b=u6oOSazMB20bE1RH+uEByZWpkpKRZQW6Jp86hUl+ZzvvoHz/D2mPTKxlHSxH8NXsdU CtQUqv3weVpvS9jeMzsSiwTbbgoaoIPuut3auanxasRbJz5W+B9qDAVuiEA9rnIZX2Ld w0V8Q1NNcxqzXnnO3L5JmTl5ZjwiGrf0PF8kk= Received: by 10.103.78.35 with SMTP id f35mr6526804mul.89.1252400956551; Tue, 08 Sep 2009 02:09:16 -0700 (PDT) Received: from azdaja.softwarehood.com ([95.180.33.218]) by mx.google.com with ESMTPS id w5sm170879mue.4.2009.09.08.02.09.15 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 08 Sep 2009 02:09:16 -0700 (PDT) Message-ID: <4AA61F3A.3040802@gmail.com> Date: Tue, 08 Sep 2009 11:09:14 +0200 From: Ivan Radovanovic User-Agent: Thunderbird 2.0.0.22 (X11/20090708) MIME-Version: 1.0 To: Jan Mikkelsen References: <4AA4E7A7.60503@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@FreeBSD.org Subject: Re: Kernel panic caused by fork X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Sep 2009 09:09:18 -0000 Jan Mikkelsen napisa: > A quick observation: This is not "one process misbehaving", it is a > large number of processes misbehaving. From an administrative point > of view, I think the response is "call setrlimit(RLIMIT_NPROC, ...)", > otherwise the expected behaviour is for your machine to stop making > forward progress. > > Having said that, I agree that panics are bad and it would be nice if > fork() returned EAGAIN, again and again and again. Or perhaps the > machine should just panic ... from fork(2) page - about errors [EAGAIN] The system-imposed limit on the total number of pro- cesses under execution would be exceeded. The limit is given by the sysctl(3) MIB variable KERN_MAXPROC. (The limit is actually ten less than this except for the super user). it seems that idea is to leave room for 10 more processes so root can kill offending process, and limits at my system are (I am running pretty much generic kernel) kern.maxproc: 6164 kern.maxprocperuid: 5547 so if there are only two users running at the same time in the system (the case when I did this testing) there is room for more than 500 processes after one user hits his limit - shouldn't panic I think Regards, Ivan From owner-freebsd-hackers@FreeBSD.ORG Tue Sep 8 09:19:57 2009 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AB1E6106566B for ; Tue, 8 Sep 2009 09:19:56 +0000 (UTC) (envelope-from janm-freebsd-hackers@transactionware.com) Received: from mail.transactionware.com (mail.transactionware.com [203.14.245.7]) by mx1.freebsd.org (Postfix) with SMTP id EA38B8FC16 for ; Tue, 8 Sep 2009 09:19:55 +0000 (UTC) Received: (qmail 20344 invoked from network); 8 Sep 2009 08:53:23 -0000 Received: from midgard.transactionware.com (192.168.1.55) by dm.transactionware.com with SMTP; 8 Sep 2009 08:53:23 -0000 Received: (qmail 24315 invoked by uid 907); 8 Sep 2009 08:53:13 -0000 Received: from jmmacpro.transactionware.com (HELO jmmacpro.transactionware.com) (192.168.1.33) by midgard.transactionware.com (qpsmtpd/0.82) with ESMTP; Tue, 08 Sep 2009 18:53:13 +1000 Mime-Version: 1.0 (Apple Message framework v1075.2) Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes From: Jan Mikkelsen In-Reply-To: <4AA4E7A7.60503@gmail.com> Date: Tue, 8 Sep 2009 18:53:13 +1000 Content-Transfer-Encoding: 7bit Message-Id: References: <4AA4E7A7.60503@gmail.com> To: Ivan Radovanovic X-Mailer: Apple Mail (2.1075.2) Cc: freebsd-hackers@FreeBSD.org Subject: Re: Kernel panic caused by fork X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Sep 2009 09:19:57 -0000 Hi, On 07/09/2009, at 8:59 PM, Ivan Radovanovic wrote: ... > After running this program I got kernel panic with message > "get_pv_entry: increase vm.pmap.shpgperproc" > IMHO it is not very good idea to bring entire system down if one > process misbehaves in this way, it is maybe much better to kill > offending process and to send this message to system log. I am not > sure whether the panic is actually caused by process forking forever > or when the system tries to create new process when maxproc limit is > already reached (since system is only printing warning message that > maxproc limit is reached and it only panics when I try to start new > process (like ps)). A quick observation: This is not "one process misbehaving", it is a large number of processes misbehaving. From an administrative point of view, I think the response is "call setrlimit(RLIMIT_NPROC, ...)", otherwise the expected behaviour is for your machine to stop making forward progress. Having said that, I agree that panics are bad and it would be nice if fork() returned EAGAIN, again and again and again. Or perhaps the machine should just panic ... Regards, Jan. From owner-freebsd-hackers@FreeBSD.ORG Tue Sep 8 10:49:01 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0D04C106566B for ; Tue, 8 Sep 2009 10:49:01 +0000 (UTC) (envelope-from crquan@gmail.com) Received: from mail-vw0-f189.google.com (mail-vw0-f189.google.com [209.85.212.189]) by mx1.freebsd.org (Postfix) with ESMTP id B66378FC12 for ; Tue, 8 Sep 2009 10:49:00 +0000 (UTC) Received: by vws27 with SMTP id 27so2164930vws.3 for ; Tue, 08 Sep 2009 03:48:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=szK4piwxZJOSD7eJUZr4kFjPU0HWd8qRGAGMkWcrb/o=; b=VDnvYXDZm2rdgiUdRDJ9HTRPR66aF9ZZj98USIh6oyQ+LIA58gp+tNCQeggjEOIJlk L3cB7SXdZ0djPSidE4nEtmvqYlx/a4Ov0TWqeYTyDK6cwjWttiWSfkYylumolYQcQrZb loG96ZVObKetxHpOx49Uq3nM7m/tFtAOOweRY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=JjMqDL05wZvRrDva1EMOuNoPykUAHYsIpYy2mXb7S7plv28g/KBAj0KOQwq6CTimP+ IyZqXWOj3FYrRJpFxyalduEbUDO82/cU0IUSgLsloBc+KuRzjAwIquSxvh9Y8NlHlVFx I1IAuxpXlt+zkd1hDCY7dXeRjcB7x5HhUYE6E= MIME-Version: 1.0 Received: by 10.220.111.80 with SMTP id r16mr14808551vcp.76.1252405323791; Tue, 08 Sep 2009 03:22:03 -0700 (PDT) In-Reply-To: <4AA4E7A7.60503@gmail.com> References: <4AA4E7A7.60503@gmail.com> Date: Tue, 8 Sep 2009 18:22:03 +0800 Message-ID: <91b13c310909080322s21e0fb02o423434206e5f96f6@mail.gmail.com> From: Cheng Renquan To: Ivan Radovanovic Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org Subject: Re: Kernel panic caused by fork X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Sep 2009 10:49:01 -0000 On Mon, Sep 7, 2009 at 6:59 PM, Ivan Radovanovic wrote: > I was testing FreeBSD's behavior when running many threads at the same ti= me > (and I find it performs excellent) when I wanted to test how system will > behave towards program that spawns itself too many times. I wrote a very > simple program > > #include > #include > > int main() { > =C2=A0while(1) > =C2=A0 fork(); > =C2=A0return 0; > } > > After running this program I got kernel panic with message > "get_pv_entry: increase vm.pmap.shpgperproc" > IMHO it is not very good idea to bring entire system down if one process > misbehaves in this way, it is maybe much better to kill offending process > and to send this message to system log. I am not sure whether the panic i= s > actually caused by process forking forever or when the system tries to > create new process when maxproc limit is already reached (since system is > only printing warning message that maxproc limit is reached and it only > panics when I try to start new process (like ps)). > System is FreeBSD 7.2-STABLE It's just the "fork bomb" problem, all operating system kernels cannot deal with it well, http://en.wikipedia.org/wiki/Fork_bomb And it's really a system administration problem rather than a kernel proble= m, --=20 Cheng Renquan (=E7=A8=8B=E4=BB=BB=E5=85=A8), from Shenzhen, China From owner-freebsd-hackers@FreeBSD.ORG Tue Sep 8 11:12:31 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9CE8B1065692 for ; Tue, 8 Sep 2009 11:12:31 +0000 (UTC) (envelope-from joachim.kuebart@gmx.net) Received: from mail.gmx.net (mail.gmx.net [213.165.64.20]) by mx1.freebsd.org (Postfix) with SMTP id E998B8FC0C for ; Tue, 8 Sep 2009 11:12:30 +0000 (UTC) Received: (qmail invoked by alias); 08 Sep 2009 10:45:47 -0000 Received: from cpc2-oxfd10-0-0-cust569.oxfd.cable.ntl.com (EHLO localhost.localdomain) [81.110.34.58] by mail.gmx.net (mp023) with SMTP; 08 Sep 2009 12:45:47 +0200 X-Authenticated: #31053830 X-Provags-ID: V01U2FsdGVkX18EDvR4tg8ESiTFil7raplmwee9qQCr8Epi6XruYd NmbfAGbmsP1HiO From: Joachim Kuebart To: freebsd-hackers@freebsd.org Content-Type: text/plain Date: Tue, 08 Sep 2009 11:45:45 +0100 Message-Id: <1252406745.778.22.camel@yacht> Mime-Version: 1.0 X-Mailer: Evolution 2.26.3nb1 Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-FuHaFi: 0.71 Subject: License change X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Sep 2009 11:12:31 -0000 Hi, much to my embarrassment, I noticed recently that there is a file authored by me using the 4-clause BSD license in the FreeBSD tree. The file src/sys/dev/sound/pci/es137x.c uses the 4-clause BSD license while the accompanying .h file uses a kind of 3-clause BSD license that I apparently made up at the time. I would like to change the license of es137x.c to the 3-clause BSD license. Unfortunately I cannot prove that I'm in fact the original author because the e-mail address given in the file is no longer active. If this means that the license cannot be changed anymore, that's unfortunate, but I guess it's the way it has to be... Best regards, Joachim From owner-freebsd-hackers@FreeBSD.ORG Tue Sep 8 16:24:38 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 949C9106568F for ; Tue, 8 Sep 2009 16:24:38 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outR.internet-mail-service.net (outr.internet-mail-service.net [216.240.47.241]) by mx1.freebsd.org (Postfix) with ESMTP id 54DDB8FC1B for ; Tue, 8 Sep 2009 16:24:38 +0000 (UTC) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id 68924B3F80; Tue, 8 Sep 2009 09:24:38 -0700 (PDT) X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (home.elischer.org [216.240.48.38]) by idiom.com (Postfix) with ESMTP id 8CF042D6010; Tue, 8 Sep 2009 09:24:37 -0700 (PDT) Message-ID: <4AA68544.8050102@elischer.org> Date: Tue, 08 Sep 2009 09:24:36 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) MIME-Version: 1.0 To: Cheng Renquan References: <4AA4E7A7.60503@gmail.com> <91b13c310909080322s21e0fb02o423434206e5f96f6@mail.gmail.com> In-Reply-To: <91b13c310909080322s21e0fb02o423434206e5f96f6@mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org, Ivan Radovanovic Subject: Re: Kernel panic caused by fork X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Sep 2009 16:24:38 -0000 Cheng Renquan wrote: > On Mon, Sep 7, 2009 at 6:59 PM, Ivan Radovanovic wrote: >> I was testing FreeBSD's behavior when running many threads at the same time >> (and I find it performs excellent) when I wanted to test how system will >> behave towards program that spawns itself too many times. I wrote a very >> simple program >> >> #include >> #include >> >> int main() { >> while(1) >> fork(); >> return 0; >> } >> >> After running this program I got kernel panic with message >> "get_pv_entry: increase vm.pmap.shpgperproc" >> IMHO it is not very good idea to bring entire system down if one process >> misbehaves in this way, it is maybe much better to kill offending process >> and to send this message to system log. I am not sure whether the panic is >> actually caused by process forking forever or when the system tries to >> create new process when maxproc limit is already reached (since system is >> only printing warning message that maxproc limit is reached and it only >> panics when I try to start new process (like ps)). >> System is FreeBSD 7.2-STABLE > > It's just the "fork bomb" problem, all operating system kernels cannot > deal with it well, > > http://en.wikipedia.org/wiki/Fork_bomb It's more a tuning problem I think. The system should tune itself so that MAXPROX is hit before critical resources are exhausted I think. Having said that, there are a lot of resources that need to be watched. > > And it's really a system administration problem rather than a kernel problem, > From owner-freebsd-hackers@FreeBSD.ORG Tue Sep 8 16:42:05 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0621E1065693 for ; Tue, 8 Sep 2009 16:42:05 +0000 (UTC) (envelope-from rivanr@gmail.com) Received: from mail-fx0-f210.google.com (mail-fx0-f210.google.com [209.85.220.210]) by mx1.freebsd.org (Postfix) with ESMTP id 8937F8FC19 for ; Tue, 8 Sep 2009 16:42:04 +0000 (UTC) Received: by fxm6 with SMTP id 6so2654763fxm.43 for ; Tue, 08 Sep 2009 09:42:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=KAAXQlCGfvZBk5/xQSnFbGnJGbnW2VRqjAgYP3vPNp4=; b=TXye/zWSz+ixau724D8GUpclawstnCCDQuiTGetPwFOQdjRMmRXQ8TNC+FCUMuP7Az 5ysGFOyg3tUTl4JmCGE0yBv+nNr+a7RWEAi+zn1V5gkxnFijyuzWKC6SHW29VebhK54o 8PLaL8VpHrptIvHj7iaRPkkBISc9inMRaptNE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; b=dWL/g8dwoc8yvUjGwVZBwAYQmPctSqPHYI8MJ1qT1cqKko/0gul1bruDtPiG7nBDum fZLkTuYJc+zU/xbMhNyExxo4EEApZxbfpVl0zeesnYpM1BzgBVzB2ktx/a5JX/WilFX5 L3lIuqJm1bvsf/OvTb0w+Du8Vva+mVQXdvs0o= Received: by 10.102.14.4 with SMTP id 4mr6747394mun.2.1252428123402; Tue, 08 Sep 2009 09:42:03 -0700 (PDT) Received: from azdaja.softwarehood.com ([95.180.33.218]) by mx.google.com with ESMTPS id i7sm208783mue.48.2009.09.08.09.42.02 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 08 Sep 2009 09:42:02 -0700 (PDT) Message-ID: <4AA68959.6000808@gmail.com> Date: Tue, 08 Sep 2009 18:42:01 +0200 From: Ivan Radovanovic User-Agent: Thunderbird 2.0.0.22 (X11/20090708) MIME-Version: 1.0 To: Julian Elischer References: <4AA4E7A7.60503@gmail.com> <91b13c310909080322s21e0fb02o423434206e5f96f6@mail.gmail.com> <4AA68544.8050102@elischer.org> In-Reply-To: <4AA68544.8050102@elischer.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org, Cheng Renquan Subject: Re: Kernel panic caused by fork X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Sep 2009 16:42:05 -0000 Julian Elischer napisa: > Cheng Renquan wrote: >> On Mon, Sep 7, 2009 at 6:59 PM, Ivan Radovanovic >> wrote: >>> I was testing FreeBSD's behavior when running many threads at the >>> same time >>> (and I find it performs excellent) when I wanted to test how system >>> will >>> behave towards program that spawns itself too many times. I wrote a >>> very >>> simple program >> It's just the "fork bomb" problem, all operating system kernels cannot >> deal with it well, >> >> http://en.wikipedia.org/wiki/Fork_bomb > It's more a tuning problem I think. The system should tune itself so > that MAXPROX is hit before critical resources are exhausted I think. > Having said that, there are a lot of resources that need to be watched. After reading this nice article on wikipedia and learning about that bash one liner I wanted to check if it really works, but I didn't want to bring the system down again (and to create crash dump and so on), so I wanted to limit number of processes for single user and I did sysctl kern.maxprocperuid=1000 as root, and after that I started bash and typed :(){ :|:& };: as normal user First thing to notice - there was more than 4000 spawned bash processes (why if I set limit to 1000 per user id?), however system didn't crash and I was eventually able to recover with /bin/kill -9 -- -1234 1234 being process group id of bash process Regards, Ivan From owner-freebsd-hackers@FreeBSD.ORG Tue Sep 8 21:01:49 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3CEB4106568F for ; Tue, 8 Sep 2009 21:01:49 +0000 (UTC) (envelope-from freebsd-hackers@m.gmane.org) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by mx1.freebsd.org (Postfix) with ESMTP id BDFCE8FC13 for ; Tue, 8 Sep 2009 21:01:48 +0000 (UTC) Received: from list by lo.gmane.org with local (Exim 4.50) id 1Ml7ox-00056T-Fz for freebsd-hackers@freebsd.org; Tue, 08 Sep 2009 23:01:47 +0200 Received: from 93-138-19-116.adsl.net.t-com.hr ([93.138.19.116]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 08 Sep 2009 23:01:47 +0200 Received: from ivoras by 93-138-19-116.adsl.net.t-com.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 08 Sep 2009 23:01:47 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-hackers@freebsd.org From: Ivan Voras Date: Tue, 08 Sep 2009 23:00:58 +0200 Lines: 54 Message-ID: References: <4AA4E7A7.60503@gmail.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigF560D7BCDCFD4CBFF39C050C" X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 93-138-19-116.adsl.net.t-com.hr User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) In-Reply-To: <4AA4E7A7.60503@gmail.com> X-Enigmail-Version: 0.96.0 Sender: news Subject: Re: Kernel panic caused by fork X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Sep 2009 21:01:49 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigF560D7BCDCFD4CBFF39C050C Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Ivan Radovanovic wrote: > I was testing FreeBSD's behavior when running many threads at the same > time (and I find it performs excellent) when I wanted to test how syste= m > will behave towards program that spawns itself too many times. I wrote = a > very simple program >=20 > #include > #include >=20 > int main() { > while(1) > fork(); > return 0; > } A simple fork bomb. Hmm, it should just crash and if it does crash it's a regression. I've "tested" fork bombs on 7-STABLE and early 8-CURRENT and they were behaving as expected - stopped at the maxproc limit. I don't currently have spare 7.x stable machines but I have just run it on 8-BETA2 one and the maxproc limit still works, though as expected the console is almost unusable for anything except switching (i.e. processes don't get to receive input very often). A lot of them are in "locked" state with "*vm ob" as state/channel name. I couldn't clean the system from the fork bomb with "killall" as root. Can you describe your machine? My is an Atom-based (slow) netbook with 1 GB RAM. --------------enigF560D7BCDCFD4CBFF39C050C Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkqmxhAACgkQldnAQVacBcjzUwCfeBvJ/Kd6zFakn6qP9BNBH9TS 1i4An09wFsbLJ7vgoyQjZ4n+sx6oBGZG =uppB -----END PGP SIGNATURE----- --------------enigF560D7BCDCFD4CBFF39C050C-- From owner-freebsd-hackers@FreeBSD.ORG Wed Sep 9 17:01:34 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 95A6F1065676 for ; Wed, 9 Sep 2009 17:01:34 +0000 (UTC) (envelope-from a_best01@uni-muenster.de) Received: from zivm-out3.uni-muenster.de (ZIVM-OUT3.UNI-MUENSTER.DE [128.176.192.18]) by mx1.freebsd.org (Postfix) with ESMTP id BE95E8FC1E for ; Wed, 9 Sep 2009 17:01:33 +0000 (UTC) X-IronPort-AV: E=Sophos;i="4.44,359,1249250400"; d="scan'208";a="12815829" Received: from zivmaildisp2.uni-muenster.de (HELO ZIVMAILUSER04.UNI-MUENSTER.DE) ([128.176.188.143]) by zivm-relay3.uni-muenster.de with ESMTP; 09 Sep 2009 19:01:31 +0200 Received: by ZIVMAILUSER04.UNI-MUENSTER.DE (Postfix, from userid 149459) id CE9971B0096; Wed, 9 Sep 2009 19:01:31 +0200 (CEST) Date: Wed, 09 Sep 2009 19:01:31 +0200 (CEST) From: Alexander Best Sender: Organization: Westfaelische Wilhelms-Universitaet Muenster To: Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: Subject: Buffer overflow detected by REDZONE with linuxulator X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Sep 2009 17:01:34 -0000 hi there, i've installed emulators/linux_dist-gentoo-stage3 and grabbed a snapshot from the ltp git repository (http://ltp.sourceforge.net/). as expected some tests failed because i'm using compat.linux.osrelease: 2.6.16 which is still missing a few linux syscalls, ipcs and ioctls. however i also noticed REDZONE reporting buffer overflows. i'm only a user and not a developer so i don't know if the ltp is to be blamed or if the problem lies within the linuxulator. i'm running 9.0-CURRENT (r196879). as i mentioned before i'm using 2.6 linux kernel emulation. here are the buffer overflow reports: Sep 9 14:12:42 otaku kernel: REDZONE: Buffer overflow detected. 9 bytes corrupted after 0xcc28c483 (3 bytes allocated). Sep 9 14:12:42 otaku kernel: Allocation backtrace: Sep 9 14:12:42 otaku kernel: #0 0xc0709aaa at redzone_setup+0x3a Sep 9 14:12:42 otaku kernel: #1 0xc05bc673 at malloc+0x1c3 Sep 9 14:12:42 otaku kernel: #2 0xc07428b8 at linux_getsockaddr+0x48 Sep 9 14:12:42 otaku kernel: #3 0xc0742eb8 at linux_socketcall+0x178 Sep 9 14:12:42 otaku kernel: #4 0xc0772f56 at syscall+0x2a6 Sep 9 14:12:42 otaku kernel: #5 0xc07568b0 at Xint0x80_syscall+0x20 Sep 9 14:12:42 otaku kernel: Free backtrace: Sep 9 14:12:42 otaku kernel: #0 0xc0709a3a at redzone_check+0x17a Sep 9 14:12:42 otaku kernel: #1 0xc05bc32d at free+0x5d Sep 9 14:12:42 otaku kernel: #2 0xc0742ef0 at linux_socketcall+0x1b0 Sep 9 14:12:42 otaku kernel: #3 0xc0772f56 at syscall+0x2a6 Sep 9 14:12:42 otaku kernel: #4 0xc07568b0 at Xint0x80_syscall+0x20 Sep 9 14:20:08 otaku kernel: REDZONE: Buffer overflow detected. 4 bytes corrupted after 0xcc2538ea (106 bytes allocated). Sep 9 14:20:08 otaku kernel: Allocation backtrace: Sep 9 14:20:08 otaku kernel: #0 0xc0709aaa at redzone_setup+0x3a Sep 9 14:20:08 otaku kernel: #1 0xc05bc673 at malloc+0x1c3 Sep 9 14:20:08 otaku kernel: #2 0xc063a902 at unp_connect+0x162 Sep 9 14:20:08 otaku kernel: #3 0xc063d6c9 at uipc_connect+0x49 Sep 9 14:20:08 otaku kernel: #4 0xc062fde2 at soconnect+0x52 Sep 9 14:20:08 otaku kernel: #5 0xc0638eb6 at kern_connect+0x96 Sep 9 14:20:08 otaku kernel: #6 0xc0742c7b at linux_connect+0x3b Sep 9 14:20:08 otaku kernel: #7 0xc0742f22 at linux_socketcall+0x1e2 Sep 9 14:20:08 otaku kernel: #8 0xc0772f56 at syscall+0x2a6 Sep 9 14:20:08 otaku kernel: #9 0xc07568b0 at Xint0x80_syscall+0x20 Sep 9 14:20:08 otaku kernel: Free backtrace: Sep 9 14:20:08 otaku kernel: #0 0xc0709a3a at redzone_check+0x17a Sep 9 14:20:08 otaku kernel: #1 0xc05bc32d at free+0x5d Sep 9 14:20:08 otaku kernel: #2 0xc063bfb2 at uipc_detach+0x242 Sep 9 14:20:08 otaku kernel: #3 0xc0632a7e at sofree+0x22e Sep 9 14:20:08 otaku kernel: #4 0xc0632f26 at soclose+0x386 Sep 9 14:20:08 otaku kernel: #5 0xc0617c49 at soo_close+0x29 Sep 9 14:20:08 otaku kernel: #6 0xc0598b13 at _fdrop+0x43 Sep 9 14:20:08 otaku kernel: #7 0xc059ab90 at closef+0x290 Sep 9 14:20:08 otaku kernel: #8 0xc059af22 at kern_close+0x102 Sep 9 14:20:08 otaku kernel: #9 0xc059b09a at close+0x1a Sep 9 14:20:08 otaku kernel: #10 0xc0772f56 at syscall+0x2a6 Sep 9 14:20:08 otaku kernel: #11 0xc07568b0 at Xint0x80_syscall+0x20 Sep 9 14:20:09 otaku kernel: REDZONE: Buffer overflow detected. 4 bytes corrupted after 0xccc653ea (106 bytes allocated). Sep 9 14:20:09 otaku kernel: Allocation backtrace: Sep 9 14:20:09 otaku kernel: #0 0xc0709aaa at redzone_setup+0x3a Sep 9 14:20:09 otaku kernel: #1 0xc05bc673 at malloc+0x1c3 Sep 9 14:20:09 otaku kernel: #2 0xc063a902 at unp_connect+0x162 Sep 9 14:20:09 otaku kernel: #3 0xc063d6c9 at uipc_connect+0x49 Sep 9 14:20:09 otaku kernel: #4 0xc062fde2 at soconnect+0x52 Sep 9 14:20:09 otaku kernel: #5 0xc0638eb6 at kern_connect+0x96 Sep 9 14:20:09 otaku kernel: #6 0xc0742c7b at linux_connect+0x3b Sep 9 14:20:09 otaku kernel: #7 0xc0742f22 at linux_socketcall+0x1e2 Sep 9 14:20:09 otaku kernel: #8 0xc0772f56 at syscall+0x2a6 Sep 9 14:20:09 otaku kernel: #9 0xc07568b0 at Xint0x80_syscall+0x20 Sep 9 14:20:09 otaku kernel: Free backtrace: Sep 9 14:20:09 otaku kernel: #0 0xc0709a3a at redzone_check+0x17a Sep 9 14:20:09 otaku kernel: #1 0xc05bc32d at free+0x5d Sep 9 14:20:09 otaku kernel: #2 0xc063bfb2 at uipc_detach+0x242 Sep 9 14:20:09 otaku kernel: #3 0xc0632a7e at sofree+0x22e Sep 9 14:20:09 otaku kernel: #4 0xc0632f26 at soclose+0x386 Sep 9 14:20:09 otaku kernel: #5 0xc0617c49 at soo_close+0x29 Sep 9 14:20:09 otaku kernel: #6 0xc0598b13 at _fdrop+0x43 Sep 9 14:20:09 otaku kernel: #7 0xc059ab90 at closef+0x290 Sep 9 14:20:09 otaku kernel: #8 0xc059af22 at kern_close+0x102 Sep 9 14:20:09 otaku kernel: #9 0xc059b09a at close+0x1a Sep 9 14:20:09 otaku kernel: #10 0xc0772f56 at syscall+0x2a6 Sep 9 14:20:09 otaku kernel: #11 0xc07568b0 at Xint0x80_syscall+0x20 Sep 9 14:20:09 otaku kernel: REDZONE: Buffer overflow detected. 4 bytes corrupted after 0xcf45a9ea (106 bytes allocated). Sep 9 14:20:09 otaku kernel: Allocation backtrace: Sep 9 14:20:09 otaku kernel: #0 0xc0709aaa at redzone_setup+0x3a Sep 9 14:20:09 otaku kernel: #1 0xc05bc673 at malloc+0x1c3 Sep 9 14:20:09 otaku kernel: #2 0xc063a902 at unp_connect+0x162 Sep 9 14:20:09 otaku kernel: #3 0xc063d6c9 at uipc_connect+0x49 Sep 9 14:20:09 otaku kernel: #4 0xc062fde2 at soconnect+0x52 Sep 9 14:20:09 otaku kernel: #5 0xc0638eb6 at kern_connect+0x96 Sep 9 14:20:09 otaku kernel: #6 0xc0742c7b at linux_connect+0x3b Sep 9 14:20:09 otaku kernel: #7 0xc0742f22 at linux_socketcall+0x1e2 Sep 9 14:20:09 otaku kernel: #8 0xc0772f56 at syscall+0x2a6 Sep 9 14:20:09 otaku kernel: #9 0xc07568b0 at Xint0x80_syscall+0x20 Sep 9 14:20:09 otaku kernel: Free backtrace: Sep 9 14:20:09 otaku kernel: #0 0xc0709a3a at redzone_check+0x17a Sep 9 14:20:09 otaku kernel: #1 0xc05bc32d at free+0x5d Sep 9 14:20:09 otaku kernel: #2 0xc063bfb2 at uipc_detach+0x242 Sep 9 14:20:09 otaku kernel: #3 0xc0632a7e at sofree+0x22e Sep 9 14:20:09 otaku kernel: #4 0xc0632f26 at soclose+0x386 Sep 9 14:20:09 otaku kernel: #5 0xc0617c49 at soo_close+0x29 Sep 9 14:20:09 otaku kernel: #6 0xc0598b13 at _fdrop+0x43 Sep 9 14:20:09 otaku kernel: #7 0xc059ab90 at closef+0x290 Sep 9 14:20:09 otaku kernel: #8 0xc059b55a at fdfree+0x3ea Sep 9 14:20:09 otaku kernel: #9 0xc05a57b3 at exit1+0x513 Sep 9 14:20:09 otaku kernel: #10 0xc05d17f4 at sigexit+0xa14 Sep 9 14:20:09 otaku kernel: #11 0xc05d19fd at postsig+0x1dd Sep 9 14:20:09 otaku kernel: #12 0xc0608fca at ast+0x35a Sep 9 14:20:09 otaku kernel: #13 0xc0757174 at doreti_ast+0x17 cheers. alex From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 06:55:57 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 77FB1106566B for ; Thu, 10 Sep 2009 06:55:57 +0000 (UTC) (envelope-from guomingyan@gmail.com) Received: from mail-ew0-f208.google.com (mail-ew0-f208.google.com [209.85.219.208]) by mx1.freebsd.org (Postfix) with ESMTP id 0F45A8FC15 for ; Thu, 10 Sep 2009 06:55:56 +0000 (UTC) Received: by ewy4 with SMTP id 4so46332ewy.36 for ; Wed, 09 Sep 2009 23:55:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:cc:content-type; bh=/iep6HpKpJcqErenzwyaOadoxp17hjbJkHPLZhI57PU=; b=Yd/Am6pxWv8xeQwhEoKnKO0oDX61OakPa4r1QpKBgZ86s0YR7dnIJJZNlqd+1sRVqG a04VmQdYw+C/64rOdI9HFqgy55P/P5yArD1mzDEam80iUZtoEa/0OBwl2JvX1sm260sb k9vGm8BJAX+UH/L0vxLPlXyVLQegaePYjrQm8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:cc:content-type; b=LrHvrh9UOcDwCUSlYCO8kP1MeBPlCn+19K9efzhExAgFJRa2mhA5QG5AA++9Vx3AkQ TOdz4uyN0hiLoFWsf8wUE3Lfqju1GdAx0EKASbYveGdTJdx5cFSulriA7UDvgMsbfGUE Tx4pY40mfsbnO8v8/dSq+dQTFeGK/TlHvtu+I= MIME-Version: 1.0 Received: by 10.210.9.5 with SMTP id 5mr432788ebi.78.1252564013540; Wed, 09 Sep 2009 23:26:53 -0700 (PDT) Date: Wed, 9 Sep 2009 23:26:53 -0700 Message-ID: <1fa17f810909092326l1271df94t1dea5ac9d5deba1b@mail.gmail.com> From: MingyanGuo To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: LI Xin Subject: How to prevent other CPU from accessing a set of pages before calling pmap_remove_all function X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 06:55:57 -0000 Hi all, I find that function pmap_remove_all for arch amd64 works with a time window between reading & clearing the PTE flags(access flag and dirty flag) and invalidating its TLB entry on other CPU. After some discussion with Li Xin(cced), I think all the processes that are using the PTE being removed should be blocked before calling pmap_remove_all, or other CPU may dirty the page but does not set the dirty flag before the TLB entry is flushed. But I can not find how to block them to call the function. I read the function vm_pageout_scan in file vm/vm_pageout.c but can not find the exact method it used. Or I just misunderstood the semantics of function pmap_remove_all ? Thanks in advance. Regards, MingyanGuo From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 06:57:25 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BA420106568B for ; Thu, 10 Sep 2009 06:57:25 +0000 (UTC) (envelope-from guomingyan@gmail.com) Received: from mail-ew0-f208.google.com (mail-ew0-f208.google.com [209.85.219.208]) by mx1.freebsd.org (Postfix) with ESMTP id 4735F8FC0A for ; Thu, 10 Sep 2009 06:57:25 +0000 (UTC) Received: by ewy4 with SMTP id 4so47106ewy.36 for ; Wed, 09 Sep 2009 23:57:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=Db1QftQ0KNHlMqva0vqufJQ8othqJc8dYGIiu2VPh8E=; b=VxbpdtfzHib8DGridxuqW2L/r0KecgGW5iolzuDcQ/eNaPxVJB8CvSXoJKIWphKK77 RoBuVpVR2JyqeQPyfZEolvQxQ1RclYYsQg4kX793DlrBRctPYFyBrVch4NWGFjLtKVIu awV+IuBkxaMMscHnzrGn+FtdNNHJnkUhGv3BU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=ssmvhIECriRTZQTmIJ2aXm4f/tOJM8sG6NN1DrTXSqqox9NRspj5yb+o+yi0zWd32j WCLQl7pl6Dy7vJmPnlBIY6TusOoP1b23TdhSNu35EewgRxA5/2siAGfOlDqee0fxe5uL GaKiH2p9fY77luVelIfWDw0iEHDoPDWwd7LxM= MIME-Version: 1.0 Received: by 10.211.172.8 with SMTP id z8mr1256202ebo.92.1252565844410; Wed, 09 Sep 2009 23:57:24 -0700 (PDT) In-Reply-To: <1fa17f810909092326l1271df94t1dea5ac9d5deba1b@mail.gmail.com> References: <1fa17f810909092326l1271df94t1dea5ac9d5deba1b@mail.gmail.com> Date: Wed, 9 Sep 2009 23:57:24 -0700 Message-ID: <1fa17f810909092357x8625182q970f8fb6aa76e7a9@mail.gmail.com> From: MingyanGuo To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: LI Xin Subject: Re: How to prevent other CPU from accessing a set of pages before calling pmap_remove_all function X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 06:57:25 -0000 On Wed, Sep 9, 2009 at 11:26 PM, MingyanGuo wrote: > Hi all, > > I find that function pmap_remove_all for arch amd64 works with a time > window between reading & clearing the PTE flags(access flag and dirty flag) > and invalidating its TLB entry on other CPU. After some discussion with Li > Xin(cced), I think all the processes that are using the PTE being removed > should be blocked before calling pmap_remove_all, or other CPU may dirty the > page but does not set the dirty flag before the TLB entry is flushed. But I > can not find how to block them to call the function. I read the function > vm_pageout_scan in file vm/vm_pageout.c but can not find the exact method it > used. Or I just misunderstood the semantics of function pmap_remove_all ? > > Thanks in advance. > > Regards, > MingyanGuo > Sorry for the noise. I understand the logic now. There is no time window problem between reading & clearing the PTE and invalidating it on other CPU, even if other CPU is using the PTE. I misunderstood the logic. Regards, MingyanGuo From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 12:08:51 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9F43D106566C for ; Thu, 10 Sep 2009 12:08:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (skuns.zoral.com.ua [91.193.166.194]) by mx1.freebsd.org (Postfix) with ESMTP id 3839B8FC14 for ; Thu, 10 Sep 2009 12:08:50 +0000 (UTC) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id n8AC8Cv8004405 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 10 Sep 2009 15:08:12 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.3/8.14.3) with ESMTP id n8AC8CYQ077994; Thu, 10 Sep 2009 15:08:12 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.3/8.14.3/Submit) id n8AC8BU6077993; Thu, 10 Sep 2009 15:08:11 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 10 Sep 2009 15:08:11 +0300 From: Kostik Belousov To: MingyanGuo Message-ID: <20090910120811.GH47688@deviant.kiev.zoral.com.ua> References: <1fa17f810909092326l1271df94t1dea5ac9d5deba1b@mail.gmail.com> <1fa17f810909092357x8625182q970f8fb6aa76e7a9@mail.gmail.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="spzFwWfYKRzjK1rH" Content-Disposition: inline In-Reply-To: <1fa17f810909092357x8625182q970f8fb6aa76e7a9@mail.gmail.com> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-hackers@freebsd.org, LI Xin Subject: Re: How to prevent other CPU from accessing a set of pages before calling pmap_remove_all function X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 12:08:51 -0000 --spzFwWfYKRzjK1rH Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Sep 09, 2009 at 11:57:24PM -0700, MingyanGuo wrote: > On Wed, Sep 9, 2009 at 11:26 PM, MingyanGuo wrote: >=20 > > Hi all, > > > > I find that function pmap_remove_all for arch amd64 works with a time > > window between reading & clearing the PTE flags(access flag and dirty f= lag) > > and invalidating its TLB entry on other CPU. After some discussion with= Li > > Xin(cced), I think all the processes that are using the PTE being remov= ed > > should be blocked before calling pmap_remove_all, or other CPU may dirt= y the > > page but does not set the dirty flag before the TLB entry is flushed. B= ut I > > can not find how to block them to call the function. I read the function > > vm_pageout_scan in file vm/vm_pageout.c but can not find the exact meth= od it > > used. Or I just misunderstood the semantics of function pmap_remove_al= l ? > > > > Thanks in advance. > > > > Regards, > > MingyanGuo > > >=20 > Sorry for the noise. I understand the logic now. There is no time window > problem between reading & clearing the PTE and invalidating it on other C= PU, > even if other CPU is using the PTE. I misunderstood the logic. Hmm. What would happen for the following scenario. Assume that the page m is mapped by vm map active on CPU1, and that CPU1 has cached TLB entry for some writable mapping of this page, but neither TLB entry not PTE has dirty bit set. Then, assume that the following sequence of events occur: CPU1: CPU2: call pmap_remove_all(m) clear pte write to the address mapped by m [*] invalidate the TLB, possibly making IPI to CPU1 I assume that at the point marked [*], we can - either loose the dirty bit, while CPU1 (atomically) sets the dirty bit in the cleared pte. Besides not properly tracking the modification status of the page, it could also cause the page table page to be modified, that would create non-zero page with PG_ZERO flag set. - or CPU1 re-reads the PTE entry when setting the dirty bit, and generates #pf since valid bit in PTE is zero. Intel documentation mentions that dirty or accessed bits updates are done with locked cycle, that definitely means that PTE is re-read, but I cannot find whether valid bit is rechecked. --spzFwWfYKRzjK1rH Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (FreeBSD) iEYEARECAAYFAkqo7CoACgkQC3+MBN1Mb4gRGgCgscvKZFeh4uPhTADH2tERZtVh Y98AnR/9HAbNm6DqTmKYv+LtC/FaJGMW =gKPs -----END PGP SIGNATURE----- --spzFwWfYKRzjK1rH-- From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 16:46:46 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BBC7E106566C for ; Thu, 10 Sep 2009 16:46:46 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from mail-qy0-f200.google.com (mail-qy0-f200.google.com [209.85.221.200]) by mx1.freebsd.org (Postfix) with ESMTP id 72E278FC0A for ; Thu, 10 Sep 2009 16:46:46 +0000 (UTC) Received: by qyk38 with SMTP id 38so243758qyk.27 for ; Thu, 10 Sep 2009 09:46:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=qNNycoBbxJH+ZU6thx3OtaGneU72NfUzfypTEP61b7M=; b=oRtvXL05FO1Dr8oUhpD7S6gq/x8zP2zwXR9zSSxKIfyDM8+5K+ka0XdmLodX68r2wc x+ieA+OS9ZP6j2vT3ksLghif121SDK6JALfyQx1QLlPRbl9L6rvrGUoGFsfmWSpTwHB4 pc9V3kpjpV6xVjvCSB1hp45yQugqSQEJdTKd0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=a5bl9pVhf/TFuUu8EynmSRq//JYFyjkMh23gjhd4zfpx2zjrjx+ZAfRXbA0qs59jjq RSrGrMHA5MgyYSjLSKIW0j/ME8uI3etXF+1NRn4wzZZpwgP55UzqkngafeLgJwEdEFU7 rGy+4SLKbDgtSLo+pQbLQ/JTbAW6VZeJMeuS8= MIME-Version: 1.0 Received: by 10.229.106.83 with SMTP id w19mr973707qco.72.1252601205724; Thu, 10 Sep 2009 09:46:45 -0700 (PDT) In-Reply-To: <200908271729.55213.jhb@freebsd.org> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org> Date: Thu, 10 Sep 2009 12:46:45 -0400 Message-ID: <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> From: Linda Messerschmidt To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 16:46:46 -0000 On Thu, Aug 27, 2009 at 5:29 PM, John Baldwin wrote: > Ah, cool, what you want to do is use KTR with KTR_SCHED and then use > schedgraph.py (src/tools/sched) to get a visual picture of what the box d= oes > during a hang. =A0The timestamps in KTR are TSC cycle counts rather than = an > actual wall time which is why they look off. =A0If you have a lot of even= ts you > may want to use a larger KTR_ENTRIES size btw (I use 1048576 (2 ^ 20) her= e at > work to get large (multiple-second) traces). I'm still working on this. I enabled KTR and set it up to log KTR_SCHED events. Then, I wrote a script to exercise the HTTP server that actually ran on that machine, and set it to issue "sysctl debug.ktr.cpumask=3D0" and abort if a request took over 2 seconds. 28,613 requests later, it tripped over one that took 2007ms. (Just a refresher: this is a static file being served by an Apache process that has nothing else to do but serve this file on a relatively unloaded machine.) I don't have access to any machines that can run X, so I did the best I could to examine it from the shell. First, this machine has two CPU's so I split up the KTR results on a per-CPU basis so I could look at each individually. With KTR_ENTRIES set to 1048576, I got about 53 seconds of data with just KTR_SCHED enabled. Since I was interested in a 2.007 second period of time right at the end, I hacked it down to the last 3.795 seconds. In the 3.795 seconds captured in the trace period on CPU 0 that includes the entire 2.007 second stall, CPU 0 was idle for 3.175 seconds. In the same period, CPU 1 was idle for 3.2589 seconds. I did the best I could to manually page through all the scheduling activity on both CPUs during that 3.7 second time, and I didn't see anything really disruptive. Mainly idle, with jumps into the clock and ethernet kernel threads, as well as httpd. If I understand that correctly and have done everything right, that means that whatever happened, it wasn't related to CPU contention or scheduling issues of any sort. So, a couple of follow-up questions: First, what else should I be looking at? I built the kernel with kind of a lot of KTR flags (KTR_LOCK|KTR_SCHED|KTR_PROC|KTR_INTR|KTR_CALLOUT|KTR_UMA|KTR_SYSC) but enabling them all produces enough output that even 1048576 entries doesn't always go back two seconds; the volume of data is all but unmanageable. Second, is there any way to correlate the process address reported by the KTR scheduler entries back to a PID? It'd be nice to be able to view the scheduler activity just for the process I'm interested in, but I can't figure out which one it is. :) Thanks! From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 17:21:02 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BA2BF1065670 for ; Thu, 10 Sep 2009 17:21:02 +0000 (UTC) (envelope-from guomingyan@gmail.com) Received: from mail-pz0-f235.google.com (mail-pz0-f235.google.com [209.85.222.235]) by mx1.freebsd.org (Postfix) with ESMTP id 99E648FC12 for ; Thu, 10 Sep 2009 17:21:02 +0000 (UTC) Received: by pzk24 with SMTP id 24so2571pzk.3 for ; Thu, 10 Sep 2009 10:21:02 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20090910120811.GH47688@deviant.kiev.zoral.com.ua> Received: by 10.114.54.8 with SMTP id c8mr909442waa.1.1252603261728; Thu, 10 Sep 2009 10:21:01 -0700 (PDT) Message-ID: <001636b149b575c79204733c6c1c@google.com> Date: Thu, 10 Sep 2009 17:21:01 +0000 From: guomingyan@gmail.com To: Kostik Belousov , MingyanGuo Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-hackers@freebsd.org, LI Xin Subject: Re: Re: How to prevent other CPU from accessing a set of pages before calling pmap_remove_all functi X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 17:21:02 -0000 On Sep 10, 2009 5:08am, Kostik Belousov wrote: > On Wed, Sep 09, 2009 at 11:57:24PM -0700, MingyanGuo wrote: > > On Wed, Sep 9, 2009 at 11:26 PM, MingyanGuo guomingyan@gmail.com> wrote: > > > > > Hi all, > > > > > > I find that function pmap_remove_all for arch amd64 works with a time > > > window between reading & clearing the PTE flags(access flag and dirty > flag) > > > and invalidating its TLB entry on other CPU. After some discussion > with Li > > > Xin(cced), I think all the processes that are using the PTE being > removed > > > should be blocked before calling pmap_remove_all, or other CPU may > dirty the > > > page but does not set the dirty flag before the TLB entry is flushed. > But I > > > can not find how to block them to call the function. I read the > function > > > vm_pageout_scan in file vm/vm_pageout.c but can not find the exact > method it > > > used. Or I just misunderstood the semantics of function > pmap_remove_all ? > > > > > > Thanks in advance. > > > > > > Regards, > > > MingyanGuo > > > > > > > Sorry for the noise. I understand the logic now. There is no time window > > problem between reading & clearing the PTE and invalidating it on other > CPU, > > even if other CPU is using the PTE. I misunderstood the logic. > Hmm. What would happen for the following scenario. > Assume that the page m is mapped by vm map active on CPU1, and that > CPU1 has cached TLB entry for some writable mapping of this page, > but neither TLB entry not PTE has dirty bit set. > Then, assume that the following sequence of events occur: > CPU1: CPU2: > call pmap_remove_all(m) > clear pte > write to the address mapped > by m [*] > invalidate the TLB, > possibly making IPI to CPU1 > I assume that at the point marked [*], we can > - either loose the dirty bit, while CPU1 (atomically) sets the dirty bit > in the cleared pte. > Besides not properly tracking the modification status of the page, > it could also cause the page table page to be modified, that would > create non-zero page with PG_ZERO flag set. > - or CPU1 re-reads the PTE entry when setting the dirty bit, and generates > #pf since valid bit in PTE is zero. > Intel documentation mentions that dirty or accessed bits updates are done > with locked cycle, that definitely means that PTE is re-read, but I cannot > find whether valid bit is rechecked. I am not an architecture expert, but from a programmer's view, I *think* using the 'in memory' PTE structure for the first write to that PTE is more reasonable. To set the dirty bit, a CPU has to access memory with locked cycles, so using the 'in memory' PTE structure should add few performance burden but more friendly to software. However, it is just my guess, I am reading the manuals to find if any description about it. Regards, MingyanGuo From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 17:30:33 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5725F1065676 for ; Thu, 10 Sep 2009 17:30:33 +0000 (UTC) (envelope-from rysto32@gmail.com) Received: from ey-out-2122.google.com (ey-out-2122.google.com [74.125.78.26]) by mx1.freebsd.org (Postfix) with ESMTP id DD1E08FC15 for ; Thu, 10 Sep 2009 17:30:32 +0000 (UTC) Received: by ey-out-2122.google.com with SMTP id 4so109081eyf.9 for ; Thu, 10 Sep 2009 10:30:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=qfC62rcGOhakT2L1Uk3XZaRRqpTlPu9kZfRokq4KVbc=; b=JVC1lE1YiDAQwVXuqACyAmSj+FdFtlCG8bl91DBXA1xdQkN5Dy6IrlIMNbj4RoKhLv 4UV+txk4oPRlKjt8ItD1DGo/DApMdEAU7We6/gbA9E1AuV6rV3pBSbz5aco0ShdRWDNd hv369YrYclMGtG5Z5HJq/xxRt3lUk0lmm0+sU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=JKDXuAJ+wr7TflaEA5X4P9K/2ZIp50XUJXIL2Mp7JEISJMQBDbSP5tiJbqsIkNCCP8 15ktA+EuZAodzb91p0nsonacfWRNxTl9ATyP4VOBZoVm8/2MrUhcVBuftaSoLQDzDBZv RvxGPHm+g6SX9PZoqnA0kEi11odXxYpAhFGxg= MIME-Version: 1.0 Received: by 10.210.101.1 with SMTP id y1mr2004016ebb.67.1252601836555; Thu, 10 Sep 2009 09:57:16 -0700 (PDT) In-Reply-To: <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org> <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> Date: Thu, 10 Sep 2009 12:57:16 -0400 Message-ID: From: Ryan Stone To: Linda Messerschmidt Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 17:30:33 -0000 You should be able to run schedgraph.py on a windows machine with python installed. It works just fine for me on XP. From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 18:36:53 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4D1311065676 for ; Thu, 10 Sep 2009 18:36:53 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from mail-qy0-f195.google.com (mail-qy0-f195.google.com [209.85.221.195]) by mx1.freebsd.org (Postfix) with ESMTP id 02A4B8FC13 for ; Thu, 10 Sep 2009 18:36:52 +0000 (UTC) Received: by qyk33 with SMTP id 33so4139qyk.14 for ; Thu, 10 Sep 2009 11:36:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=0y6GXBQfPRSM1p1wkRQwAVsAH8YmHNuWOwsX3BEG/WI=; b=ePreWNDHWFINBiLRHPtluE9R/oqDMRYHYqkUOLhUvbbaLNsyrfqVZh+1p5icGNiHsr 5xk0XrusKD1r3Ap5zDvna4ex/mYa/7CyvKJRNQ1bDtEPlSbfGhYXqQ44nTYP2B5ie7Ot ftQzIb3r5GskmK12tjibjCGJ09+ULEJ3cHV1A= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=iI0nOwUwBydW70PEh046n3uUiE/5zp9JJSEfBAjVcZig2CGGkf4gftp7ylEiGMO0W5 1YfWwRfQYGDyYLzPc0kFj/mKdBJiwkRVSkFBIpKTct9U8VzSAy/giYespRTEBI/wC6Q9 QQyVzd3qs8DdUtDZ+T42WH7HywMW/xE/4F0qw= MIME-Version: 1.0 Received: by 10.229.39.69 with SMTP id f5mr1039615qce.107.1252607363953; Thu, 10 Sep 2009 11:29:23 -0700 (PDT) In-Reply-To: References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org> <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> Date: Thu, 10 Sep 2009 14:29:23 -0400 Message-ID: <237c27100909101129y28771061o86db3c6a50a640eb@mail.gmail.com> From: Linda Messerschmidt To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 18:36:53 -0000 On Thu, Sep 10, 2009 at 12:57 PM, Ryan Stone wrote: > You should be able to run schedgraph.py on a windows machine with python > installed.=A0 It works just fine for me on XP. Don't have any of those either, but I *did* get it working on a Mac right out of the box. Should have thought of that sooner. :) The output looks pretty straightforward, but there are a couple of things I find odd. First, there's a point right around what I estimate to be the problem time where schedgraph.py indicates gmond (the Ganglia monitor) was running uninterrupted for a period of exactly 1 second. However, it also indicates that both CPU's idle tasks were *also* running almost continuously during that time (subject to clock/net interrupts), and that the run queue on both CPU's was zero for most of that second while gmond was allegedly running. Second, the interval I graphed was about nine seconds. During that time, the PHP command line script made a whole lot of requests: it usleeps 50ms between requests, and non-broken requests average about 1.4ms. So even with the stalled request chopping 2 seconds off the end, there should be somewhere in the neighborhood of 130 requests during the graphed period. But that php process doesn't appear in the schedgraph output at all. So that doesn't make a whole lot of sense to me. I'll try to get another trace and see if that happens the same way again. From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 18:46:47 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7B6B3106568F for ; Thu, 10 Sep 2009 18:46:47 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outN.internet-mail-service.net (outn.internet-mail-service.net [216.240.47.237]) by mx1.freebsd.org (Postfix) with ESMTP id 5C9A58FC1C for ; Thu, 10 Sep 2009 18:46:47 +0000 (UTC) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id 34EDEB9888; Thu, 10 Sep 2009 11:46:47 -0700 (PDT) X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (home.elischer.org [216.240.48.38]) by idiom.com (Postfix) with ESMTP id B3EFB2D6021; Thu, 10 Sep 2009 11:46:46 -0700 (PDT) Message-ID: <4AA94995.6030700@elischer.org> Date: Thu, 10 Sep 2009 11:46:45 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) MIME-Version: 1.0 To: Linda Messerschmidt References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org> <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> <237c27100909101129y28771061o86db3c6a50a640eb@mail.gmail.com> In-Reply-To: <237c27100909101129y28771061o86db3c6a50a640eb@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 18:46:47 -0000 Linda Messerschmidt wrote: > On Thu, Sep 10, 2009 at 12:57 PM, Ryan Stone wrote: >> You should be able to run schedgraph.py on a windows machine with python >> installed. It works just fine for me on XP. > > Don't have any of those either, but I *did* get it working on a Mac > right out of the box. Should have thought of that sooner. :) > > The output looks pretty straightforward, but there are a couple of > things I find odd. > > First, there's a point right around what I estimate to be the problem > time where schedgraph.py indicates gmond (the Ganglia monitor) was > running uninterrupted for a period of exactly 1 second. However, it > also indicates that both CPU's idle tasks were *also* running almost > continuously during that time (subject to clock/net interrupts), and > that the run queue on both CPU's was zero for most of that second > while gmond was allegedly running. I've noticed that schedgraph tends to show the idle threads slightly skewed one way or the other. I think there is a cumulative rounding error in the way they are drawn due to the fact that they are run so often. Check the raw data and I think you will find that you just need to imagine the idle threads slightly to the left or right a bit. The longer the trace and the further to he right you are looking the more "out" the idle threads appear to be. I saw this on both Linux and Mac python implementations. > > Second, the interval I graphed was about nine seconds. During that > time, the PHP command line script made a whole lot of requests: it > usleeps 50ms between requests, and non-broken requests average about > 1.4ms. So even with the stalled request chopping 2 seconds off the > end, there should be somewhere in the neighborhood of 130 requests > during the graphed period. But that php process doesn't appear in the > schedgraph output at all. > > So that doesn't make a whole lot of sense to me. > > I'll try to get another trace and see if that happens the same way again. > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org" From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 19:12:37 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7AD3E106568D for ; Thu, 10 Sep 2009 19:12:37 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from qw-out-2122.google.com (qw-out-2122.google.com [74.125.92.26]) by mx1.freebsd.org (Postfix) with ESMTP id 232A98FC16 for ; Thu, 10 Sep 2009 19:12:36 +0000 (UTC) Received: by qw-out-2122.google.com with SMTP id 3so135323qwe.7 for ; Thu, 10 Sep 2009 12:12:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=sVbC4Lh/vy1vEFmT5CDa2TzOaYb0hOzpY+MQ/76B9/o=; b=bXy5eGDEziznh0qxMQHOITc7/6oY067mEczlDrGNHzpXI5H5/8uzVobA9IQVlsu8PF yTpbqEhyup07vyGbD4nDiWduOJVKPWBWYvjIHvtP2YM3IN316ZRJ4vrvGgslfyTiqpR9 mRkLfv676H3PDECL5xxoRH1KaehQ8BAyVA+uE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=hAbRRNHIbN5FNsWXrFkokNuriZ2GDaghcxj09vCcujsynDtzirliPNJKhE6WaL3CyY isbRNo7xaU+egxKj1xUc34Ko+jgWYvTRxbYu6C5aDVk/rnNGqz1Jg6MD3yeEK4FW9Kwx SsXgSSGTdd1yTDkaCf0jPIbM7V7TgAQyW5KvM= MIME-Version: 1.0 Received: by 10.229.118.135 with SMTP id v7mr1052007qcq.62.1252609655432; Thu, 10 Sep 2009 12:07:35 -0700 (PDT) In-Reply-To: <4AA94995.6030700@elischer.org> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org> <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> <237c27100909101129y28771061o86db3c6a50a640eb@mail.gmail.com> <4AA94995.6030700@elischer.org> Date: Thu, 10 Sep 2009 15:07:35 -0400 Message-ID: <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com> From: Linda Messerschmidt To: Julian Elischer Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 19:12:37 -0000 On Thu, Sep 10, 2009 at 2:46 PM, Julian Elischer wrote= : > I've noticed that schedgraph tends to show the idle threads slightly > skewed one way or the other. =A0I think there is a cumulative rounding > error in the way they are drawn due to the fact that they are run so > often. =A0Check the raw data and I think you will find that you just > need to imagine the idle threads slightly to the left or right a bit. No, there's no period anywhere in the trace where either idle thread didn't run for an entire second. I'm pretty sure schedgraph is throwing in some nonsense results. I did capture a second, larger, dataset after a 2.1s stall, and schedgraph includes an httpd process that supposedly spent 58 seconds on the run queue. I don't know if it's a dropped record or a parsing error or what. I do think on this second graph I can kind of see the *end* of the stall, because all of a sudden a ton of processes... everything from sshd to httpd to gmond to sh to vnlru to bufdaemon to fdc0... comes off of whatever it's waiting on and hits the run queue. The combined run queues for both processors spike up to 32 tasks at one point and then rapidly tail off as things return to normal. That pretty much matches the behavior shown by ktrace in my initial post, where everything goes to sleep on something-or-other in the kernel, and then at the end of the stall, everything wakes up at the same time. I think this means the problem is somehow related to locking, rather than scheduling. From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 11 01:34:32 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 03A9C1065679 for ; Fri, 11 Sep 2009 01:34:32 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from qw-out-2122.google.com (qw-out-2122.google.com [74.125.92.25]) by mx1.freebsd.org (Postfix) with ESMTP id AE13F8FC12 for ; Fri, 11 Sep 2009 01:34:31 +0000 (UTC) Received: by qw-out-2122.google.com with SMTP id 3so224843qwe.7 for ; Thu, 10 Sep 2009 18:34:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=AcOA6yWvOzZohD7QvXPRhGb02MXgGs1sLdcUkXvI6w0=; b=ILyy6PH3CMZKLolRDXFlBsWrPcISlN0qfaTR8nPGeSvtx/C79R9wLcUDlJpecn2Yv8 LlcoRl1/mSORdDstPfnKx6vOL0SpOfznsUAR0qP36qyAWQoEf19b2RnUkK+Z72Yrl1MB W+yMpCpHbrDJxaHdCYpo6lKheRCG45EWlK4iQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=ACqCSRvS+vFOMHOGRcC9ZqKV4rJqtTuMqyvEOOOLYiICoGzF4zeZyddMoBRqVwGWda uOUSiMkgk1WK8BzdTIBMIHVQvtPyyh7U4TAyeDXGLGNOsYS2AgwEAYuDkT8HGYpFGqoz 7iQLoe0DmjhCOpRX5bw+2/XDTvAuLbfDI7t0E= MIME-Version: 1.0 Received: by 10.229.9.147 with SMTP id l19mr1146347qcl.65.1252632870963; Thu, 10 Sep 2009 18:34:30 -0700 (PDT) In-Reply-To: <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org> <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> <237c27100909101129y28771061o86db3c6a50a640eb@mail.gmail.com> <4AA94995.6030700@elischer.org> <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com> Date: Thu, 10 Sep 2009 21:34:30 -0400 Message-ID: <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com> From: Linda Messerschmidt To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2009 01:34:32 -0000 Just to follow up, I've been doing some testing with masking for KTR_LOCK rather than KTR_SCHED. I'm having trouble with this because I have the KTR buffer size set to 1048576 entries, and with only KTR_LOCK enabled, this isn't enough for even a full second of tracing; the sample I'm working with now is just under 0.9s. It's an average of one entry every 2001 TSC ticks. That *seems* like a lot of locking activity, but some of the lock points are only a couple of lines apart, so maybe it's just incredibly verbose. Since it's so much data and I'm still working on a way to correlate it (lockgraph.py?), all I've got so far is a list of what trace points are coming up the most: 51927 src/sys/kern/kern_lock.c:215 (_lockmgr UNLOCK mtx_unlock() when flags & LK_INTERLOCK) 48033 src/sys/kern/vfs_subr.c:2284 (vdropl UNLOCK) 41548 src/sys/kern/vfs_subr.c:2187 (vput VI_LOCK) 29359 src/sys/kern/vfs_subr.c:2067 (vget VI_LOCK) 29358 src/sys/kern/vfs_subr.c:2079 (vget VI_UNLOCK) 23799 src/sys/nfsclient/nfs_subs.c:755 (nfs_getattrcache mtx_lock) 23460 src/sys/nfsclient/nfs_vnops.c:645 (nfs_getattr mtx_unlock) 23460 src/sys/nfsclient/nfs_vnops.c:642 (nfs_getattr mtx_lock) 23460 src/sys/nfsclient/nfs_subs.c:815 (nfs_getattrcache mtx_unlock) 23138 src/sys/kern/vfs_cache.c:345 (cache_lookup CACHE_LOCK) Unfortunately, it kind of sounds like I'm on my way to answering "why is this system slow?" even though it really isn't slow. (And I rush to point out that the Apache process in question doesn't at any point in its life touch NFS, though some of the other ones on the machine do.) In order to be the cause of my Apache problem, all this goobering around with NFS would have to be relatively infrequent but so intense that it shoves everything else out of the way. I'm skeptical, but I'm sure one of you guys can offer a more informed opinion. The only other thing I can think of is maybe all this is running me out of something I need (vnodes?) so everybody else blocks until it finishes and lets go of whatever finite resource it's using up? But that doesn't make a ton of sense either, because why would a lack of vnodes cause stalls in accept() or select() in unrelated processes? Not sure if I'm going in the right direction here or not. From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 11 15:19:28 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 00660106566C for ; Fri, 11 Sep 2009 15:19:27 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id B04AB8FC1A for ; Fri, 11 Sep 2009 15:19:27 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 40AC646B03; Fri, 11 Sep 2009 11:19:27 -0400 (EDT) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 772D48A01B; Fri, 11 Sep 2009 11:19:26 -0400 (EDT) From: John Baldwin To: freebsd-hackers@freebsd.org Date: Fri, 11 Sep 2009 11:02:14 -0400 User-Agent: KMail/1.9.7 References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com> <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com> In-Reply-To: <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200909111102.14503.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Fri, 11 Sep 2009 11:19:26 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Linda Messerschmidt Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2009 15:19:28 -0000 On Thursday 10 September 2009 9:34:30 pm Linda Messerschmidt wrote: > Just to follow up, I've been doing some testing with masking for > KTR_LOCK rather than KTR_SCHED. > > I'm having trouble with this because I have the KTR buffer size set to > 1048576 entries, and with only KTR_LOCK enabled, this isn't enough for > even a full second of tracing; the sample I'm working with now is just > under 0.9s. It's an average of one entry every 2001 TSC ticks. That > *seems* like a lot of locking activity, but some of the lock points > are only a couple of lines apart, so maybe it's just incredibly > verbose. > > Since it's so much data and I'm still working on a way to correlate it > (lockgraph.py?), all I've got so far is a list of what trace points > are coming up the most: > > 51927 src/sys/kern/kern_lock.c:215 (_lockmgr UNLOCK mtx_unlock() when > flags & LK_INTERLOCK) > 48033 src/sys/kern/vfs_subr.c:2284 (vdropl UNLOCK) > 41548 src/sys/kern/vfs_subr.c:2187 (vput VI_LOCK) > 29359 src/sys/kern/vfs_subr.c:2067 (vget VI_LOCK) > 29358 src/sys/kern/vfs_subr.c:2079 (vget VI_UNLOCK) > 23799 src/sys/nfsclient/nfs_subs.c:755 (nfs_getattrcache mtx_lock) > 23460 src/sys/nfsclient/nfs_vnops.c:645 (nfs_getattr mtx_unlock) > 23460 src/sys/nfsclient/nfs_vnops.c:642 (nfs_getattr mtx_lock) > 23460 src/sys/nfsclient/nfs_subs.c:815 (nfs_getattrcache mtx_unlock) > 23138 src/sys/kern/vfs_cache.c:345 (cache_lookup CACHE_LOCK) > > Unfortunately, it kind of sounds like I'm on my way to answering "why > is this system slow?" even though it really isn't slow. (And I rush > to point out that the Apache process in question doesn't at any point > in its life touch NFS, though some of the other ones on the machine > do.) > > In order to be the cause of my Apache problem, all this goobering > around with NFS would have to be relatively infrequent but so intense > that it shoves everything else out of the way. I'm skeptical, but I'm > sure one of you guys can offer a more informed opinion. > > The only other thing I can think of is maybe all this is running me > out of something I need (vnodes?) so everybody else blocks until it > finishes and lets go of whatever finite resource it's using up? But > that doesn't make a ton of sense either, because why would a lack of > vnodes cause stalls in accept() or select() in unrelated processes? > > Not sure if I'm going in the right direction here or not. Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in mtx_init() if MTX_SPIN is set) and use a schedgraph.py from the latest RELENG_7. It knows how to parse KTR_LOCK events and drop event "bars" for locks showing when they are held. A more recently schedgraph.py might also fix the bugs you were seeing with the idle threads looking too long (esp. at the start and end of graphs). -- John Baldwin From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 11 15:35:17 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 21EB41065672 for ; Fri, 11 Sep 2009 15:35:17 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outY.internet-mail-service.net (outy.internet-mail-service.net [216.240.47.248]) by mx1.freebsd.org (Postfix) with ESMTP id 086C48FC14 for ; Fri, 11 Sep 2009 15:35:16 +0000 (UTC) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id C6CC99DA80; Fri, 11 Sep 2009 08:35:16 -0700 (PDT) X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (home.elischer.org [216.240.48.38]) by idiom.com (Postfix) with ESMTP id 322E02D6026; Fri, 11 Sep 2009 08:35:16 -0700 (PDT) Message-ID: <4AAA6E32.2080609@elischer.org> Date: Fri, 11 Sep 2009 08:35:14 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) MIME-Version: 1.0 To: John Baldwin References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com> <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> In-Reply-To: <200909111102.14503.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org, Linda Messerschmidt Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2009 15:35:17 -0000 John Baldwin wrote: > > > A more recently schedgraph.py might also > fix the bugs you were seeing with the idle threads looking too long (esp. at > the start and end of graphs). not unless something has been fixed in the last week or so. From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 11 17:35:02 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5A7AC1065672; Fri, 11 Sep 2009 17:35:02 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from qw-out-2122.google.com (qw-out-2122.google.com [74.125.92.24]) by mx1.freebsd.org (Postfix) with ESMTP id 027A98FC22; Fri, 11 Sep 2009 17:35:01 +0000 (UTC) Received: by qw-out-2122.google.com with SMTP id 3so398647qwe.7 for ; Fri, 11 Sep 2009 10:35:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=om33xvDen/rMS5cuPwK8tF3AE5NDy4Tb7L9akpe400E=; b=qO5DYVTFj2b6ubn+WG4fZs0iHebWu1ZWo8V2Uhq0E6IvOgwR716xZyhgVwd5u2LMq9 FRyLN7ny7LLUs0hPiWu2bJUqi+h29KoPTqYhjltQ5j0gwpZ5pfoTC68eTLT1uU4yjaha QW2UF0exwPbMEyEJFiLORz3gGkQGZbGAgzoYk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=TiWTjKD54Xr8FvMjJ+I0QAFRmsfDmvfBlVosAdz/hffLpWh9U6z1AX2VyOtk9PNSop a3bg1uU1rtv8qMjRc5ib+O772Kl6bPp9w8cgNkr7KwSuOII2vkPBAV/sAGkM8I9Fz2ut TxKwP4UgHV67t0jOf5Q2gDXd1J3YB70gb0V7A= MIME-Version: 1.0 Received: by 10.229.23.212 with SMTP id s20mr1355284qcb.71.1252690501044; Fri, 11 Sep 2009 10:35:01 -0700 (PDT) In-Reply-To: <200909111102.14503.jhb@freebsd.org> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com> <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> Date: Fri, 11 Sep 2009 13:35:00 -0400 Message-ID: <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> From: Linda Messerschmidt To: John Baldwin Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2009 17:35:02 -0000 On Fri, Sep 11, 2009 at 11:02 AM, John Baldwin wrote: > Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in > mtx_init() if MTX_SPIN is set) I have *no* idea what you just said. :) Which is fine. But more to the point, I have no idea how to do it. :) > A more recently schedgraph.py might also > fix the bugs you were seeing with the idle threads looking too long (esp. at > the start and end of graphs). We are already on RELENG_7 due to the KTR-enabling rebuild, so that'd be the version we're using unless, as Julian observed, it's been fixed in the past week or so. Thanks! From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 11 19:14:46 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4239A1065693 for ; Fri, 11 Sep 2009 19:14:46 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 120668FC1C for ; Fri, 11 Sep 2009 19:14:46 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id A101346B35; Fri, 11 Sep 2009 15:14:45 -0400 (EDT) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8]) by bigwig.baldwin.cx (Postfix) with ESMTPA id E42868A026; Fri, 11 Sep 2009 15:14:44 -0400 (EDT) From: John Baldwin To: Julian Elischer Date: Fri, 11 Sep 2009 13:00:37 -0400 User-Agent: KMail/1.9.7 References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <4AAA6E32.2080609@elischer.org> In-Reply-To: <4AAA6E32.2080609@elischer.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200909111300.37599.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Fri, 11 Sep 2009 15:14:44 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: freebsd-hackers@freebsd.org, Linda Messerschmidt Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2009 19:14:46 -0000 On Friday 11 September 2009 11:35:14 am Julian Elischer wrote: > John Baldwin wrote: > > > > > > A more recently schedgraph.py might also > > fix the bugs you were seeing with the idle threads looking too long (esp. at > > the start and end of graphs). > > not unless something has been fixed in the last week or so. Well, I wasn't sure how old of a schedgraph.py is being used. 7.1 would have the bugs, but I think 7.2 should be fine. -- John Baldwin From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 11 19:14:47 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E574C1065694 for ; Fri, 11 Sep 2009 19:14:47 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id B5E9D8FC13 for ; Fri, 11 Sep 2009 19:14:47 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 66A3946B52; Fri, 11 Sep 2009 15:14:47 -0400 (EDT) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 7E1348A01B; Fri, 11 Sep 2009 15:14:46 -0400 (EDT) From: John Baldwin To: Linda Messerschmidt Date: Fri, 11 Sep 2009 15:06:47 -0400 User-Agent: KMail/1.9.7 References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> In-Reply-To: <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200909111506.47309.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Fri, 11 Sep 2009 15:14:46 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2009 19:14:48 -0000 On Friday 11 September 2009 1:35:00 pm Linda Messerschmidt wrote: > On Fri, Sep 11, 2009 at 11:02 AM, John Baldwin wrote: > > Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in > > mtx_init() if MTX_SPIN is set) > > I have *no* idea what you just said. :) > > Which is fine. But more to the point, I have no idea how to do it. :) Something like this: Index: sys/kern/kern_mutex.c =================================================================== --- sys/kern/kern_mutex.c (.../mirror/FreeBSD/stable/7) (revision 195943) +++ sys/kern/kern_mutex.c (.../stable/7) (revision 195943) @@ -747,6 +747,10 @@ if (opts & MTX_NOPROFILE) flags |= LO_NOPROFILE; + /* XXX: Only log for regular mutexes. */ + if (opts & MTX_SPIN) + flags |= LO_QUIET; + /* Initialize mutex. */ m->mtx_lock = MTX_UNOWNED; m->mtx_recurse = 0; > > A more recently schedgraph.py might also > > fix the bugs you were seeing with the idle threads looking too long (esp. at > > the start and end of graphs). > > We are already on RELENG_7 due to the KTR-enabling rebuild, so that'd > be the version we're using unless, as Julian observed, it's been fixed > in the past week or so. Hmm. It works well for me for doing traces. -- John Baldwin From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 11 23:14:23 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EF84A106566B; Fri, 11 Sep 2009 23:14:23 +0000 (UTC) (envelope-from jilles@stack.nl) Received: from mx1.stack.nl (relay02.stack.nl [IPv6:2001:610:1108:5010::104]) by mx1.freebsd.org (Postfix) with ESMTP id B7CB48FC0A; Fri, 11 Sep 2009 23:14:23 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id CCBB335A829; Sat, 12 Sep 2009 01:14:22 +0200 (CEST) Received: by snail.stack.nl (Postfix, from userid 1677) id B0850228CD; Sat, 12 Sep 2009 01:14:22 +0200 (CEST) Date: Sat, 12 Sep 2009 01:14:22 +0200 From: Jilles Tjoelker To: Eygene Ryabinkin Message-ID: <20090911231422.GA41683@stack.nl> References: <4A7B1DB0.1040602@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Cc: freebsd-hackers@freebsd.org, Doug Barton Subject: Re: Problem in bin/sh stripping the * character through ${expansion%} X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2009 23:14:24 -0000 On Fri, Aug 07, 2009 at 03:26:50AM +0400, Eygene Ryabinkin wrote: > Thu, Aug 06, 2009 at 11:15:12AM -0700, Doug Barton wrote: > > I came across this problem during a recent portmaster update. When > > trying to strip off the * character using variable expansion in bin/sh > > it doesn't work. Other "special" characters do work if they are > > properly escaped. > > The attached mini-script clearly shows the problem: > > $ sh sh-strip-problem > > var before stripping: foo\* > > var after stripping: foo\* > > var before stripping: foo\$ > > var after stripping: foo\ > According to the sh(1), it is not a problem. Namely, > - \* being unquoted at all will produce a lone '*'; > - '*' when treated as the smallest pattern, will result in a stripping > of a zero-length string -- it is the smallest pattern in the case of > '*' that matches anything. That is indeed an explanation why it works that way, but I think it is wrong. Generally, the shell command language avoids unnecessary levels of quoting. In the POSIX spec, "Shell Command Language", note the part about "${x#*}" (pattern) and ${x#"*"} (literal asterisk). Also compare with case $something in \*) echo asterisk;; esac which matches a literal asterisk. Two PRs already exist for aspects of stripping: bin/57554 (double quotes) and bin/117748 (trying to match pattern matching characters literally). > In order to strip the trailing star you should use > ----- > var=${var%[*]} > ----- > This gives you the pattern of '[*]' that is properly treated as the > single star -- it's a weird way to escape the star in the patterns. This is indeed a good workaround. -- Jilles Tjoelker From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 12 02:05:16 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D1C4F106568B; Sat, 12 Sep 2009 02:05:16 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from qw-out-2122.google.com (qw-out-2122.google.com [74.125.92.26]) by mx1.freebsd.org (Postfix) with ESMTP id 740DE8FC1F; Sat, 12 Sep 2009 02:05:16 +0000 (UTC) Received: by qw-out-2122.google.com with SMTP id 3so510164qwe.7 for ; Fri, 11 Sep 2009 19:05:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=DB9dYl4Xb52O8ZSJQSPaSURbA1jWuHf6B81Kc2pwfoc=; b=ippOQgOWPszWVgs106f9osi5Cocz5c3jgPEs4LJKgF9GEfwxbZ5lG6URpYEUDOYNIT rEtiCMevutbV7eTvLq5HiDiYbehfMiVl6c3rKltug7PT/RuBNcoBv3BB+hz+EPUrcqDN 3c1paR9+nv62uKXi/U7xLDeumVNiIvBCB7cGk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=P1WCcfYrqNxK2SBukrZ1+Sckvf+7kC8uv/uO3MBMmszLVuL7V+JCEx8qR0GCGaVnhs EgTlhV9auh57UQbzkVanQF8DwUstFAjic8/eDGVWSddrMTiPQKhkcx5IQfJWqL1BfwYq t/mgIGDq3xOXBCBNjDFnjYe6gdfT61OB8xQqk= MIME-Version: 1.0 Received: by 10.229.10.13 with SMTP id n13mr1443531qcn.103.1252721115838; Fri, 11 Sep 2009 19:05:15 -0700 (PDT) In-Reply-To: <200909111506.47309.jhb@freebsd.org> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> Date: Fri, 11 Sep 2009 22:05:15 -0400 Message-ID: <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> From: Linda Messerschmidt To: John Baldwin Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Sep 2009 02:05:16 -0000 On Fri, Sep 11, 2009 at 3:06 PM, John Baldwin wrote: > Something like this: Ah, I understand now. :) Got up to 17 seconds of trace with that change. > Hmm. =A0It works well for me for doing traces. It definitely works, it just always seems to have some-or-another weird artifact. But, with the lock info added, the locks that show big ugly gaping multi-second "lock acquire" bars are: unp_mtx and so_rcv_sx. I'm not 100% confident in this data yet, so I will try to get more data to confirm, but if that offers any clues about where to look, I'm all ears. I'm also a bit hazy on what the dark grey vs. light grey background is abou= t. Thanks! From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 12 03:55:36 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 530541065670; Sat, 12 Sep 2009 03:55:36 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from qw-out-2122.google.com (qw-out-2122.google.com [74.125.92.27]) by mx1.freebsd.org (Postfix) with ESMTP id EC23E8FC14; Sat, 12 Sep 2009 03:55:35 +0000 (UTC) Received: by qw-out-2122.google.com with SMTP id 3so522095qwe.7 for ; Fri, 11 Sep 2009 20:55:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=G01ABb/Ddx21tXgaeEnog8YW96nc8mzkPuBBMY514z8=; b=CRWDThaOrl+AKs0iLZiQ7daVKoGaSuUg2plrUrFgmR4RHQk08z3zib8jGur13nou8R osmlV7fiPd2XCei+vX6DXjfSu5Y8Uwyjsmis09f0NHDb3Mzn5+l8vG6W1hox3dA3EjH9 iVn5RHJUM4xsBBtzI+mTBEaY6IB1T7raBCPTU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=bQdWyuiZJAhuOx1PxoSrOvGCnpK1LJp3bQGhBG/sih+wv8t4XLxLmi+DiRGKBcNTnM el5RlrB1l0e8sM8/YLDzkUDuCh0RhYPAqo6r+xdhYQEMv1qYgIjqNsWC2qki0hvAIgHv OM31pptO9iACY+oEAI4ZDvSwslMi5FZQ8B84c= MIME-Version: 1.0 Received: by 10.229.29.85 with SMTP id p21mr1488496qcc.101.1252727735381; Fri, 11 Sep 2009 20:55:35 -0700 (PDT) In-Reply-To: <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> Date: Fri, 11 Sep 2009 23:55:35 -0400 Message-ID: <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> From: Linda Messerschmidt To: John Baldwin Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Sep 2009 03:55:36 -0000 OK, I have learned that ktrdump looks up the name of the process associated with a particular KSE at the the time of the dump, so if it's changed since tracing stopped, it will blissfully blame the wrong process. I understand why that's the case, but it still sucks for troubleshooting. :( This time, "pf task mtx" and "vnode_free_list" are the locks getting the blame. The processes fingered are an httpd ( (the root "parent" of the one doing the work, which does nothing but select() for 1s and wait to see if its children died), and vnlru. No correlation at all to the previous results, and this machine is now utterly quiescent except for the httpd process and the PHP exerciser. Hard to imagine vnlru has 1s worth of running to do on a machine with 949 total vnodes in use. A third run produced a 997ms "lock acquire" for "buffer daemon lock," a 497ms one for ip6qlock (no, there's no IPv6 in use on this machine), and an 8s (!!!) one on unp_mtx. bufdaemon had a 997s "running" bar, but according to the raw TSC values, that happened on the same CPU 1.999s *after* the 997ms buffer daemon lock acquire. I really don't know where to go from here. There's so little consistency that I'm just not sure if the data is bad, the tool is bad, the operator is bad, or there's some problem so fundamentally horrible that all I'm seeing is random side effects. From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 12 04:06:15 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 29CC41065670 for ; Sat, 12 Sep 2009 04:06:15 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outQ.internet-mail-service.net (outq.internet-mail-service.net [216.240.47.240]) by mx1.freebsd.org (Postfix) with ESMTP id 0B3508FC23 for ; Sat, 12 Sep 2009 04:06:14 +0000 (UTC) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id D6174B9872; Fri, 11 Sep 2009 21:06:14 -0700 (PDT) X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (home.elischer.org [216.240.48.38]) by idiom.com (Postfix) with ESMTP id 160F42D6011; Fri, 11 Sep 2009 21:06:14 -0700 (PDT) Message-ID: <4AAB1E34.2060908@elischer.org> Date: Fri, 11 Sep 2009 21:06:12 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) MIME-Version: 1.0 To: Linda Messerschmidt References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> In-Reply-To: <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Sep 2009 04:06:15 -0000 Linda Messerschmidt wrote: > OK, I have learned that ktrdump looks up the name of the process > associated with a particular KSE at the the time of the dump, so if > it's changed since tracing stopped, it will blissfully blame the wrong > process. I understand why that's the case, but it still sucks for > troubleshooting. :( > > This time, "pf task mtx" and "vnode_free_list" are the locks getting > the blame. The processes fingered are an httpd ( (the root "parent" > of the one doing the work, which does nothing but select() for 1s and > wait to see if its children died), and vnlru. No correlation at all > to the previous results, and this machine is now utterly quiescent > except for the httpd process and the PHP exerciser. Hard to imagine > vnlru has 1s worth of running to do on a machine with 949 total vnodes > in use. > > A third run produced a 997ms "lock acquire" for "buffer daemon lock," > a 497ms one for ip6qlock (no, there's no IPv6 in use on this machine), > and an 8s (!!!) one on unp_mtx. bufdaemon had a 997s "running" bar, > but according to the raw TSC values, that happened on the same CPU > 1.999s *after* the 997ms buffer daemon lock acquire. > > I really don't know where to go from here. There's so little > consistency that I'm just not sure if the data is bad, the tool is > bad, the operator is bad, or there's some problem so fundamentally > horrible that all I'm seeing is random side effects. > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org" does the system have a serial console? how about a normal console /keyboard? how often deos it hang? and for how long? is there a chance that you could notice when it is hung and hit and drop it into the debugger IN teh hung state? It is possible if you have a serial port to make a program that sends a char back and forth and when the machine hangs, sends teh magic sequence. (I think it's CR for serial debugger break, but I'm sure you can look up the kernel options and the chars in google.) if you can drop the machine into DDB (teh kernel debugger) in teh hung state, then there are lots of comands you can do to find out what is wrong. jhb actually gave a short talk that I videod and put on youtube on the topic. ps will show you what is actually running on which CPU and you an see what locks all the other processes are waiting on. then you can examine those locks and see who owns them. From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 12 04:47:33 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1A12C1065670; Sat, 12 Sep 2009 04:47:33 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from mail-qy0-f200.google.com (mail-qy0-f200.google.com [209.85.221.200]) by mx1.freebsd.org (Postfix) with ESMTP id ADB998FC19; Sat, 12 Sep 2009 04:47:32 +0000 (UTC) Received: by qyk38 with SMTP id 38so1416699qyk.27 for ; Fri, 11 Sep 2009 21:47:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=E0ES0qzX1s/QQHqzNRUrEmLteYbJ5VeX2hoOHPWWUg8=; b=stI2a9NCwc458GiVAL7onTljaIwWDcQLtpWXJN74tA/5qGt/7WEVou34mtg2CgHZHB QOovx+6zhUOd+76ukA1gWp1FvyV//3aq0pczq6fXcdIYupB0PKyj2Av71yr2QY3qjWzp WCptlpP9YO4L5W6mXdjQdPJQlTJeHs5MqyFkg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=u6UEqjLwxIaCAl8xDIukGAj/ufoEowGSIX73EPaV9QWW3Hth0EgSsHTatd3UbmZ2zf mMtKtRExEiHzs0PR4fQsP0ZFCyhuqCuDbhuwXYBpoqMQAEaCOKI2U5taEURdDLPGP+35 ALsZuPH/FPED64my6nLoETbjTuD5Cg2pYrCQ8= MIME-Version: 1.0 Received: by 10.229.119.69 with SMTP id y5mr1469532qcq.100.1252730851756; Fri, 11 Sep 2009 21:47:31 -0700 (PDT) In-Reply-To: <4AAB1E34.2060908@elischer.org> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> <4AAB1E34.2060908@elischer.org> Date: Sat, 12 Sep 2009 00:47:31 -0400 Message-ID: <237c27100909112147h64f71585p2a97f2b48a510985@mail.gmail.com> From: Linda Messerschmidt To: Julian Elischer Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Sep 2009 04:47:33 -0000 On Sat, Sep 12, 2009 at 12:06 AM, Julian Elischer wro= te: > does the system have a serial console? how about a normal console /keyboa= rd? It has an IP KVM. > how often deos it hang? and for =A0how long? Well, this is interesting. I got really frustrated with the other approach, so I thought I'd thin a machine down absolutely as far as I could, eliminate every possible source of delay, and see what happens. I killed everything... cron, RPC, NFS, devd, gmon, nrpe, everything. The Apache and its exerciser are now the only things running on the machine, and the Apache is only touching an md0 swap device mounted on /mnt. I *still* get the hangs. It hangs for all sorts of different periods, but the duration of the stall is approximately inversely proportional to the chance of seeing it. To get a short delay, you need wait only a little bit. If you want a 2-3 second delay, you may have to wait 15-20 minutes. *However* in order to answer your question, I changed up the test program, which up til now has been cycling requests every 50 ms until it gets one >2s, at which point it sysctls to stop ktr and aborts. Now it prints the timestamp of all "too long" requests. But I also dropped the threshold for "too long" from 2s to 100ms, since with everything on RAM disk, there's no longer any reason to expect a request to take more than 1-2ms in the worst case. The results are pretty profound: 1252729876: request 82 131ms 1252729883: request 210 388ms 1252729890: request 338 380ms 1252729897: request 466 388ms 1252729904: request 594 404ms 1252729919: request 849 810ms 1252729926: request 977 386ms 1252729933: request 1105 370ms 1252729940: request 1233 366ms 1252729947: request 1361 400ms 1252729961: request 1617 746ms 1252729968: request 1744 477ms 1252729975: request 1872 388ms 1252729982: request 2000 380ms 1252729989: request 2128 384ms 1252729996: request 2256 395ms It goes on and on like this, I get a 380-400ms stall every seven seconds. I have had a few come back higher, in the 750-850ms range, usually after missing a beat: 1252729897: request 466 388ms 1252729904: request 594 404ms 1252729919: request 849 810ms 1252729926: request 977 386ms 1252730010: request 2512 416ms 1252730017: request 2640 390ms 1252730031: request 2896 774ms 1252730038: request 3023 431ms 1252730454: request 10568 378ms 1252730461: request 10696 397ms 1252730475: request 10952 733ms 1252730482: request 11080 366ms So far, nothing over 1s. So what happens every seven seconds?? From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 12 05:47:15 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4F52C106566C for ; Sat, 12 Sep 2009 05:47:15 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outW.internet-mail-service.net (outw.internet-mail-service.net [216.240.47.246]) by mx1.freebsd.org (Postfix) with ESMTP id E79738FC16 for ; Sat, 12 Sep 2009 05:47:14 +0000 (UTC) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id C3B71D1DCD; Fri, 11 Sep 2009 22:47:14 -0700 (PDT) X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (home.elischer.org [216.240.48.38]) by idiom.com (Postfix) with ESMTP id 1C0E12D6018; Fri, 11 Sep 2009 22:47:14 -0700 (PDT) Message-ID: <4AAB35E0.3000908@elischer.org> Date: Fri, 11 Sep 2009 22:47:12 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) MIME-Version: 1.0 To: Linda Messerschmidt References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> <4AAB1E34.2060908@elischer.org> <237c27100909112147h64f71585p2a97f2b48a510985@mail.gmail.com> In-Reply-To: <237c27100909112147h64f71585p2a97f2b48a510985@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Sep 2009 05:47:15 -0000 Linda Messerschmidt wrote: > On Sat, Sep 12, 2009 at 12:06 AM, Julian Elischer wrote: >> does the system have a serial console? how about a normal console /keyboard? > > It has an IP KVM. > >> how often deos it hang? and for how long? > > Well, this is interesting. I got really frustrated with the other > approach, so I thought I'd thin a machine down absolutely as far as I > could, eliminate every possible source of delay, and see what happens. > I killed everything... cron, RPC, NFS, devd, gmon, nrpe, everything. > The Apache and its exerciser are now the only things running on the > machine, and the Apache is only touching an md0 swap device mounted on > /mnt. I *still* get the hangs. ok now we need to describe the hang.. if you can predictably get a hang every 7 seconds does this mean that it doesn't respond to keyboard for a moment every 7 seconds? or that it doesn't accept packets every 7 seconds? if you lean on the A key, do you see echo stop every 7 seconds for a moment? Or is it just the apache process that hangs? Does the watching process that you refer to below also hang? would it hang if it tried to access the disk? if the watching process is on the same machine, does it only trigger AFTER teh request has taken a ling time or could it time out with a select DURING the delayed response? (another way of asking "how hung is 'hung'?" > > It hangs for all sorts of different periods, but the duration of the > stall is approximately inversely proportional to the chance of seeing > it. To get a short delay, you need wait only a little bit. If you > want a 2-3 second delay, you may have to wait 15-20 minutes. > > *However* in order to answer your question, I changed up the test > program, which up til now has been cycling requests every 50 ms until > it gets one >2s, at which point it sysctls to stop ktr and aborts. > > Now it prints the timestamp of all "too long" requests. But I also > dropped the threshold for "too long" from 2s to 100ms, since with > everything on RAM disk, there's no longer any reason to expect a > request to take more than 1-2ms in the worst case. > > The results are pretty profound: > > 1252729876: request 82 131ms > 1252729883: request 210 388ms > 1252729890: request 338 380ms > 1252729897: request 466 388ms > 1252729904: request 594 404ms > 1252729919: request 849 810ms > 1252729926: request 977 386ms > 1252729933: request 1105 370ms > 1252729940: request 1233 366ms > 1252729947: request 1361 400ms > 1252729961: request 1617 746ms > 1252729968: request 1744 477ms > 1252729975: request 1872 388ms > 1252729982: request 2000 380ms > 1252729989: request 2128 384ms > 1252729996: request 2256 395ms > > It goes on and on like this, I get a 380-400ms stall every seven > seconds. I have had a few come back higher, in the 750-850ms range, > usually after missing a beat: > > 1252729897: request 466 388ms > 1252729904: request 594 404ms > 1252729919: request 849 810ms > 1252729926: request 977 386ms > > 1252730010: request 2512 416ms > 1252730017: request 2640 390ms > 1252730031: request 2896 774ms > 1252730038: request 3023 431ms > > 1252730454: request 10568 378ms > 1252730461: request 10696 397ms > 1252730475: request 10952 733ms > 1252730482: request 11080 366ms > > So far, nothing over 1s. > > So what happens every seven seconds?? From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 12 06:52:52 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B033C1065676; Sat, 12 Sep 2009 06:52:52 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from mail-qy0-f200.google.com (mail-qy0-f200.google.com [209.85.221.200]) by mx1.freebsd.org (Postfix) with ESMTP id 50A588FC15; Sat, 12 Sep 2009 06:52:52 +0000 (UTC) Received: by qyk38 with SMTP id 38so1442528qyk.27 for ; Fri, 11 Sep 2009 23:52:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=t/N0lDkppVuny6+YPau70tbzO/7VQqp55IgXZcc6wIs=; b=gQWtXNqzFf/JV/izYVgfiPz3rI8ixeC7LMGI11POKgvyYtVKfzajHuOxnYmZWoCAfe I/Ec0rBIBsUv3AkUwqxS2kHMJmg0o5Jdyu8XF9U36U+P3OiKqv2ORr5PAwZ4lBBq9umo 4hxcAnzA1AnFiZQzVdKnSaJ7SRVo4SM/eW6sU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=bk24Dba7Zudem1d7T0d5vQ6hBdESwNmefkaITp0Ct2+NYwFF1cOYs/jSHR4nUJJnUn I/C08RAeKbzG6tRAVIsBE9Kyb+CrthFjY6gvuJFmB2lpktz7g6gCKGo8UHFDz5Jcrbx+ Mu3RNpFxt+vkN70GIBnLURStyPgwI9SsV21jw= MIME-Version: 1.0 Received: by 10.229.106.83 with SMTP id w19mr1573556qco.72.1252738371579; Fri, 11 Sep 2009 23:52:51 -0700 (PDT) In-Reply-To: <4AAB35E0.3000908@elischer.org> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> <4AAB1E34.2060908@elischer.org> <237c27100909112147h64f71585p2a97f2b48a510985@mail.gmail.com> <4AAB35E0.3000908@elischer.org> Date: Sat, 12 Sep 2009 02:52:51 -0400 Message-ID: <237c27100909112352k5504357dge725c8f905ee650a@mail.gmail.com> From: Linda Messerschmidt To: Julian Elischer Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Sep 2009 06:52:52 -0000 On Sat, Sep 12, 2009 at 1:47 AM, Julian Elischer wrot= e: > ok now we need to describe the hang.. =A0if you can predictably get a han= g > every 7 seconds does this mean that it doesn't respond to keyboard for a > moment every 7 seconds? It's possible. > or that it doesn't accept packets every 7 seconds? It appears that it accepts & responds to at least pings; I was able to do an every-0.1-seconds ping through a bevy of 300-1900ms stalls with: 2323 packets transmitted, 2323 packets received, 0% packet loss round-trip min/avg/max/stddev =3D 0.120/1.019/5.979/0.288 ms As best as I could tell, schedgraph also showed that the clock interrupt and the em0 interrupt always got serviced on time. Pretty much seems like its userspace that's getting put on hold. > Or is it just the apache process that hangs? This is where I started from. In the original post (way long ago now), I described how pretty much every process on the system went into the kernel for something and stalled there, and then when the stall ends, they all unblock at once. I posted some examples via ktrace that I sadly no longer have the source data for. > Does the watching process that you refer to below also hang? I don't think I can say for sure. I observe visual stalls from time to time in the output if I have it show every request where there is no stall shown, which could either indicate that a stall occurred outside the request or that my shoddy Internet connection has 100ms latency and consistent 1% packet loss, which it does. I did write a short C program that just select()s on stdin for 100ms over and over and aborts if it takes more than 125ms to go through the loop; it never aborts, even through 1s+ stalls and the loop times it reports are consistently 110ms regardless of what else is going on, which I don't think is unexpected. However, I'm not sure why that differs from the behavior of the "master" Apache processes, which select() for 1 second all day long, but do appear to be affected. Maybe because they are selecting a network socket instead of a tty? I don't know. Also, if I disable NTP, the system does not appear to lose time during the stalls, which fits with the consistent clock interrupts I saw. > would it hang if it tried to access the disk? By using the md device, I believe I have removed the disk from the equation; neither process is accessing it. Even without doing that, if I leave iostat -w 1 running alongside the test, there's no correlation between the tiny amount of disk activity there is and observed stalls. > if the watching process is on the same machine, does it only trigger AFTE= R > teh request has taken a ling time or could it time out with a select DURI= NG > the delayed response? (another way of asking "how hung > is 'hung'?" It's just a PHP script using libcurl to request the file. I only moved it to the same machine in order to have it be able to write the sysctl to stop the KTR traces I was doing. If you're asking could the check script be modified to time out after, say, 1 second, and if so, would it return during the hang or after it? I don't know. My guess based on the earlier ktrace output is that it would time out, but not return until the hang ended. I'll see if I the curl lib exposes a configurable timeout and try it. From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 12 06:55:09 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 403E41065676; Sat, 12 Sep 2009 06:55:09 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from mail-qy0-f200.google.com (mail-qy0-f200.google.com [209.85.221.200]) by mx1.freebsd.org (Postfix) with ESMTP id D7AC58FC0A; Sat, 12 Sep 2009 06:55:08 +0000 (UTC) Received: by qyk38 with SMTP id 38so1442943qyk.27 for ; Fri, 11 Sep 2009 23:55:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=2+ZgbRPsxYeTS0xEmoVSWlqq+uFDF/XB4YoF8preiS0=; b=r+W7DlJAgpX4QS+7SJT+3m/l1Hfz6kL9wxw5ulRnwBGhfubYP9O256v2jE0btLYv2t xBDR6GIb2B063eXLI1hcPaqItkqARGp+AImXJ9x0FRqQTwfGbyfIi4oTX9j08A0VIjTM rkpgod/ESvrXjuQHRUabswJbN26eUmm/0VxUQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=CMq6b8nfMjN9eOYHAc0ChgP6WesfQjQOOGUNkNl1VOYkbdgcrhNIp/zC2mcAPk/r3Q Knpeyh/v+Kzd9njgh0x2luf0bTH45GsqX0vVndxUnUy3BzJ1Nux0IQTWMS8qDwoSL5bf 1iXAjNIYl1P0Cm8FuaKVkZ3nGoYBMwLYoEYmM= MIME-Version: 1.0 Received: by 10.229.119.69 with SMTP id y5mr1480758qcq.100.1252738508390; Fri, 11 Sep 2009 23:55:08 -0700 (PDT) In-Reply-To: <237c27100909112352k5504357dge725c8f905ee650a@mail.gmail.com> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> <4AAB1E34.2060908@elischer.org> <237c27100909112147h64f71585p2a97f2b48a510985@mail.gmail.com> <4AAB35E0.3000908@elischer.org> <237c27100909112352k5504357dge725c8f905ee650a@mail.gmail.com> Date: Sat, 12 Sep 2009 02:55:08 -0400 Message-ID: <237c27100909112355xbf1354djfe0b562195546bca@mail.gmail.com> From: Linda Messerschmidt To: Julian Elischer Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Sep 2009 06:55:09 -0000 On Sat, Sep 12, 2009 at 2:52 AM, Linda Messerschmidt wrote: > On Sat, Sep 12, 2009 at 1:47 AM, Julian Elischer wr= ote: >> ok now we need to describe the hang.. =A0if you can predictably get a ha= ng >> every 7 seconds does this mean that it doesn't respond to keyboard for a >> moment every 7 seconds? > > It's possible. Oops, I meant to explain that my ISP connection and personal sense of time are probably not good enough to say one way or the other for sure. I do see stalls, but I can't say whether they are the same stall or just a dropped packet somewhere along the way. From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 12 07:52:23 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E81FA1065676; Sat, 12 Sep 2009 07:52:23 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from mail-qy0-f195.google.com (mail-qy0-f195.google.com [209.85.221.195]) by mx1.freebsd.org (Postfix) with ESMTP id 88D448FC12; Sat, 12 Sep 2009 07:52:23 +0000 (UTC) Received: by qyk33 with SMTP id 33so252696qyk.14 for ; Sat, 12 Sep 2009 00:52:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=zMOGUudmOKhc8QWBvSLLYus+/xijt/SOWjTny66tLDY=; b=qnV0WdMjnL6DIUf1GAaNEdGuPnhXNX8sF/+K2q2U25pBs8TSWiUGlZ8ibjNSnEjb2L Rkyz+EbzNn4FOWdmCU4mh/M/Xyc50yvTnb80qgObAJT5DDqZrMYJEoqBjRI0QMTeijaT qvneSQRMPnoHxjW23gNwkvVfhZPJDqQBUZ+YI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=sVWbpmSmsmdN9GKph/qP7/ZKJ7pl0UANg8yEuocBHDEsPuf3kBF4XgRFXY1OEbk7DR jeSXf7cSeFZo835MLK3qDvTs+SSt7gnHrA0+w3JVTAdzR63HAmkS/GRv8REwVE9hRYFP X6z+3EqWSz5nj6wsliVDgGqkDPTBQN9ePGYP8= MIME-Version: 1.0 Received: by 10.229.9.147 with SMTP id l19mr1536685qcl.65.1252741942766; Sat, 12 Sep 2009 00:52:22 -0700 (PDT) In-Reply-To: <237c27100909112352k5504357dge725c8f905ee650a@mail.gmail.com> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> <4AAB1E34.2060908@elischer.org> <237c27100909112147h64f71585p2a97f2b48a510985@mail.gmail.com> <4AAB35E0.3000908@elischer.org> <237c27100909112352k5504357dge725c8f905ee650a@mail.gmail.com> Date: Sat, 12 Sep 2009 03:52:22 -0400 Message-ID: <237c27100909120052k1db7e029xcf36e075865d29d8@mail.gmail.com> From: Linda Messerschmidt To: Julian Elischer Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Sep 2009 07:52:24 -0000 OK, first, I figured out the seven second thing. I actually had already found that particular issue earlier in the troubleshooting process, but forgot all about it when I pulled in a second machine to test with. It was simply a case of setting Apache's MaxRequestsPerChild to a very low value (128) in combination with only allowing 1 access at a time. 128 requests * (50ms sleep + 2ms request + overhead) ~=3D 7s. So that was just noise masking the real problem, which is less frequent and less predictable. Sorry for the red herring. :( On Sat, Sep 12, 2009 at 2:52 AM, Linda Messerschmidt wrote: > If you're asking could the check script be modified to time out after, > say, 1 second, and if so, would it return during the hang or after it? > =A0I don't know. =A0My guess based on the earlier ktrace output is that i= t > would time out, but not return until the hang ended. =A0I'll see if I > the curl lib exposes a configurable timeout and try it. This proved to be quite easy to do. I ran the script twice, once with the timeout and once without. Without timeout: 1252741492: request 910 101ms 1252741567: request 2133 1429ms 1252741603: request 2722 146ms With 1s timeout: 1252741492: request 1078 106ms 1252741567: request 2302 1010ms (<--- Timeout) 1252741567: request 2303 273ms (<--- after 50ms sleep, goes back to end of stall) 1252741603: request 2892 136ms As you can see, the two scripts experience stalls in pretty much lockstep, but the script itself does not appear affected, so it's just on the Apache side. From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 12 13:54:00 2009 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EC0A3106568D for ; Sat, 12 Sep 2009 13:54:00 +0000 (UTC) (envelope-from des@des.no) Received: from tim.des.no (tim.des.no [194.63.250.121]) by mx1.freebsd.org (Postfix) with ESMTP id B277B8FC0C for ; Sat, 12 Sep 2009 13:54:00 +0000 (UTC) Received: from ds4.des.no (des.no [84.49.246.2]) by smtp.des.no (Postfix) with ESMTP id BFC2E6D418 for ; Sat, 12 Sep 2009 13:53:59 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 96406844A5; Sat, 12 Sep 2009 15:53:59 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: hackers@freebsd.org Date: Sat, 12 Sep 2009 15:53:59 +0200 Message-ID: <86fxasl154.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.92 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: Subject: DDB capture buffer X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Sep 2009 13:54:01 -0000 The default maximum size of the DDB capture buffer is 48 kB. This is ridiculously low; it's not even nearly enough to capture the output from the first example in textdump(4): script kdb.enter.panic=3Dtextdump set; capture on; show allpcpu; bt; ps; alltrace; show alllocks; call doadump; reset Would anyone object to increasing it to 1 MB? DDB is opt-in, so it will only affect people who compile it into their kernel (or -CURRENT users who don't compile it out; they have it coming). DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no