From owner-freebsd-current@FreeBSD.ORG Thu Apr 8 07:31:04 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 30DBE16A4CE; Thu, 8 Apr 2004 07:31:04 -0700 (PDT) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2EFD643D41; Thu, 8 Apr 2004 07:31:03 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])i38EV24u009365; Fri, 9 Apr 2004 00:31:02 +1000 Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) i38EUxsg004240; Fri, 9 Apr 2004 00:31:00 +1000 Date: Fri, 9 Apr 2004 00:30:58 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Robert Watson In-Reply-To: Message-ID: <20040409000110.J16339@gamplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: current@freebsd.org Subject: Re: panic on one cpu leaves others running... X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Apr 2004 14:31:04 -0000 On Thu, 8 Apr 2004, Robert Watson wrote: > panic: m 0 so->so_rcv.sb_cc 17 > at line 860 in file ../../../kern/uipc_socket.c > cpuid = 1; > Debugger("panic") > Stopped at Debugger+0x46: xchgl %ebx,in_Debugger.0 > db> trace > Debugger(c07c3990) at Debugger+0x46 > __panic(c07c98f1,35c,c07c997d,0,11) at __panic+0x13d > soreceive(c6664618,e9891c0c,e9891c38,0,e9891c10) at soreceive+0x20c > recvit(c6561e70,3,e9891cc0,0,bfbfe410) at recvit+0x1a2 > recvmsg(c6561e70,e9891d14,3,4,296) at recvmsg+0x9a > syscall(808002f,bfbf002f,bfbf002f,bfbfe44c,8079a70) at syscall+0x217 > Xint0x80_syscall() at Xint0x80_syscall+0x1d > --- syscall (27, FreeBSD ELF32, recvmsg), eip = 0x282afff7, esp = > 0xbfbfe3fc, ebp = 0xbfbfe458 --- > db> Apr 8 04:09:29 sm-mta[3550]: i3831Ija003419: SYSERR(root): hash map > "Alias0": missing map file /etc/mail/aliases.db: No such file or directory > Apr 8 04:09:29 sm-mta[3550]: i3831Ija003419: SYSERR(root): cannot > flock(/etc/mail/aliases, fd=5, type=1, omode=40000, euid=0): Operation not > supported > > Funky, eh? I thought we used to have code to ipi the other cpu's and halt > them until the cpu in ddb was out agian. I guess I mis-remember, or that > code is broken... ddb stops the other CPUs (at least on i386's, unless you have edited smptests.h to comment out the option CPUSTOP_ON_DDBBREAK which should be non-optional (always enabled)), but plain panic() doesn't stop them immediately, so much may happen on other CPUs if ddb is not called from panic() or if ddb has problems stopping the CPUs. ddb does have problems stopping the CPU's, but I don't see how it can reach the db> prompt before stopping them. The main problem is that stopping all the other CPUs may be impossible because one of them is looping with IPIs disabled, perhaps because it is trying to enter ddb (and stop other CPUs) too. All CPUs entering ddb should hang in this case. Half-baked fixes: %%% Index: db_interface.c =================================================================== RCS file: /home/ncvs/src/sys/i386/i386/db_interface.c,v retrieving revision 1.81 diff -u -2 -r1.81 db_interface.c --- db_interface.c 3 Apr 2004 22:23:36 -0000 1.81 +++ db_interface.c 4 Apr 2004 05:37:38 -0000 @@ -35,4 +35,5 @@ #include #include +#include #include #include @@ -41,4 +42,5 @@ #include #ifdef SMP +#include #include /** CPUSTOP_ON_DDBBREAK */ #endif @@ -61,4 +63,33 @@ static jmp_buf db_global_jmpbuf; +#ifdef SMP +/* XXX this is cloned from stop_cpus() since that function can hang. */ +static int +attempt_to_stop_cpus(u_int map) +{ + int i; + + if (!smp_started) + return 0; + + CTR1(KTR_SMP, "attempt_to_stop_cpus(%x)", map); + + /* send the stop IPI to all CPUs in map */ + ipi_selected(map, IPI_STOP); + + i = 0; + while ((atomic_load_acq_int(&stopped_cpus) & map) != map) { + /* spin */ + i++; + if (i == 100000000) { + printf("timeout stopping cpus\n"); + break; + } + } + + return 1; +} +#endif /* SMP */ + /* * kdb_trap - field a TRACE or BPT trap @@ -69,4 +100,8 @@ u_int ef; volatile int ddb_mode = !(boothowto & RB_GDB); +#ifdef SMP + static u_int kdb_trap_lock = NOCPU; + static u_int output_lock; +#endif /* @@ -91,16 +126,48 @@ #ifdef SMP + if (atomic_cmpset_int(&kdb_trap_lock, NOCPU, PCPU_GET(cpuid)) == 0 && + kdb_trap_lock != PCPU_GET(cpuid)) { + while (atomic_cmpset_int(&output_lock, 0, 1) == 0) + ; + db_printf( + "concurrent ddb entry: type %d trap, code=%x cpu=%d\n", + type, code, PCPU_GET(cpuid)); + atomic_store_rel_int(&output_lock, 0); + if (type == T_BPTFLT) + regs->tf_eip--; + else { + while (atomic_cmpset_int(&output_lock, 0, 1) == 0) + ; + db_printf( +"concurrent ddb entry on non-breakpoint: too hard to handle properly\n"); + atomic_store_rel_int(&output_lock, 0); + } + while (atomic_load_acq_int(&kdb_trap_lock) != NOCPU) + ; + write_eflags(ef); + return (1); + } +#endif + +#ifdef SMP #ifdef CPUSTOP_ON_DDBBREAK +#define VERBOSE_CPUSTOP_ON_DDBBREAK_NOT #if defined(VERBOSE_CPUSTOP_ON_DDBBREAK) + while (atomic_cmpset_int(&output_lock, 0, 1) == 0) + ; db_printf("\nCPU%d stopping CPUs: 0x%08x...", PCPU_GET(cpuid), PCPU_GET(other_cpus)); + atomic_store_rel_int(&output_lock, 0); #endif /* VERBOSE_CPUSTOP_ON_DDBBREAK */ /* We stop all CPUs except ourselves (obviously) */ - stop_cpus(PCPU_GET(other_cpus)); + attempt_to_stop_cpus(PCPU_GET(other_cpus)); #if defined(VERBOSE_CPUSTOP_ON_DDBBREAK) + while (atomic_cmpset_int(&output_lock, 0, 1) == 0) + ; db_printf(" stopped.\n"); + atomic_store_rel_int(&output_lock, 0); #endif /* VERBOSE_CPUSTOP_ON_DDBBREAK */ @@ -192,22 +259,37 @@ #if defined(VERBOSE_CPUSTOP_ON_DDBBREAK) + while (atomic_cmpset_int(&output_lock, 0, 1) == 0) + ; db_printf("\nCPU%d restarting CPUs: 0x%08x...", PCPU_GET(cpuid), stopped_cpus); + atomic_store_rel_int(&output_lock, 0); #endif /* VERBOSE_CPUSTOP_ON_DDBBREAK */ /* Restart all the CPUs we previously stopped */ if (stopped_cpus != PCPU_GET(other_cpus) && smp_started != 0) { + while (atomic_cmpset_int(&output_lock, 0, 1) == 0) + ; db_printf("whoa, other_cpus: 0x%08x, stopped_cpus: 0x%08x\n", PCPU_GET(other_cpus), stopped_cpus); + atomic_store_rel_int(&output_lock, 0); +#if 0 panic("stop_cpus() failed"); +#endif } restart_cpus(stopped_cpus); #if defined(VERBOSE_CPUSTOP_ON_DDBBREAK) + while (atomic_cmpset_int(&output_lock, 0, 1) == 0) + ; db_printf(" restarted.\n"); + atomic_store_rel_int(&output_lock, 0); #endif /* VERBOSE_CPUSTOP_ON_DDBBREAK */ #endif /* CPUSTOP_ON_DDBBREAK */ #endif /* SMP */ + +#ifdef SMP + atomic_store_rel_int(&kdb_trap_lock, NOCPU); +#endif write_eflags(ef); %%% This is supposed to wait for the other CPUs to either stop or enter ddb. They shouldn't loop with interrupts disabled anywhere else. The output_lock stuff here is especially half baked. The VERBOSE_CPUSTOP_ON_DDBBREAK option should be non-optional (always disabled), but I needed something to debug concurrent entry and interleaved output is hard to read. Bruce