Date: Fri, 27 May 2011 14:06:59 +0200 From: Marius Strobl <marius@alchemy.franken.de> To: Peter Jeremy <peterjeremy@acm.org> Cc: freebsd-sparc64@freebsd.org Subject: Re: 'make -j16 universe' gives SIReset Message-ID: <20110527120659.GA78000@alchemy.franken.de> In-Reply-To: <20110526234728.GA69750@server.vk2pj.dyndns.org> References: <20110526234728.GA69750@server.vk2pj.dyndns.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, May 27, 2011 at 09:47:28AM +1000, Peter Jeremy wrote: > I tried a "make -j16 universe" using a recent 8-stable on a 16-CPU > V890 and after about 11 minutes, I got the following. This box > had been running Solaris without problem for several years so I'm > inclined to suspect a software issue. It probably doesn't hurt to check the hardware with SunVTS though. > Any suggestions? > > ERROR: CPU4 SIReset > > > System State (CPU4 reporting) > > BBC Devices: 0000.0000.0000.000f 0000.0000.0000.000f > BBC Arb: 0000.0000.0000.000f 0000.0000.0000.000f > BBC Quiesce: 0000.0000.0000.0003 0000.0000.0000.0003 > BBC WDogAct: 0000.0000.0000.0000 0000.0000.0000.0000 > BBC POR Gen: 0000.0000.0000.0000 0000.0000.0000.0000 > BBC XIR Gen: 0000.0000.0000.0000 0000.0000.0000.0000 > BBC POR Src: 0000.0000.0000.0000 0000.0000.0000.0000 > BBC XIR Src: 0000.0000.0000.000f 0000.0000.0000.000f > BBC EBus TC: 014f.99fd.a7e6.3f29 014f.99fd.a7e6.3f29 > > CMP0 Core Config/Control registers: > > CoreAvail: 0000.0000.0000.0003 0 1 > CoreEnabled: 0000.0000.0000.0003 0 1 > CoreRunning: 0000.0000.0000.0003 0 1 > XIRSteering: 0000.0000.0000.0003 0 1 > ErrSteering: 0000.0000.0000.0000 > > CPU0 Config/Control/Status registers: > > CPUVersion: 003e.0018.3100.0507 > SafConfig: 0caa.01bc.2000.8002 9:1 ID:0 HBM TOL:15 > SafBaseAdr: 0000.0400.0000.0000 > DispatchCtl: 0000.0000.0000.0009 MS SI > DCacheCtl: 0000.0200.0000.0010 WE > ECacheCtl: 0000.0000.01c5.5000 5:1 8MB mode=5-5-5(2) R/W-turn:2 Late-Sel ECC:off > ErrorEnable: 0000.0000.0000.000b CEEN NCEEN UCEEN > > AFAR: 0000.0000.0000.0000 > AFSR: 0000.0000.0000.0000 (no errors set) > AFAR 2: 0000.0000.8000.0000 > AFSR 2: 0000.0000.0000.0000 (no errors set) > > DMMU SFAR: 0000.0000.f3f8.c300 > DMMU SFSR: 0000.0000.0000.0000 (no status set) > IMMU SFSR: 0000.0000.0080.8000 TM > This doesn't indicate much, especially not the address of the instruction causing the SIR, except that there was an i-TLB miss, which seems innocuous. Generally, FreeBSD only triggers a SIR when something really unexpected happens in an environemt where we can't or at least can't easily trigger a panic. The only exception to this which is not really fatal from the OS point of view are stray vector interrupts (IIRC even OpenSolaris just ignores a certain amount of these). You could try whether the following patch makes any difference to the SIR you're seeing: http://people.freebsd.org/~marius/sparc64_intr_vector_stray.diff Generally, both USIV and V880 with USIII (which should be quite close to a V890) are rather quirky hardware; I've already hit two CPU bugs which are not documented in the publicly available errata. Two other things to try is to replace the following in cheetah.c: val &= ~DCR_DTPE; once with: val &= ~(DCR_DTPE | DCR_ITPE); and once with: val &= ~DCR_SI; Besides that, IIRC I haven't added a workaround for the USVI+ erratum #4 so far, which seems unlikely to be the cause of this problem though. Marius
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110527120659.GA78000>