From owner-freebsd-current@FreeBSD.ORG Thu Jul 17 22:20:56 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 34F3A37B401 for ; Thu, 17 Jul 2003 22:20:56 -0700 (PDT) Received: from lightpro1.lightpro.de (lightpro1.lightpro.de [213.133.98.202]) by mx1.FreeBSD.org (Postfix) with ESMTP id 758EE43F85 for ; Thu, 17 Jul 2003 22:20:54 -0700 (PDT) (envelope-from h@schmalzbauer.de) Received: from akima (ppp-62-245-163-14.mnet-online.de [62.245.163.14]) (authenticated bits=0)h6I5KqS5003771; Fri, 18 Jul 2003 07:20:53 +0200 From: "Harald Schmalzbauer" To: "Harald Schmalzbauer" , Date: Fri, 18 Jul 2003 07:20:33 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) In-Reply-To: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Importance: Normal Subject: RE: HPT372 bug summary [was: RE: escalation stage 2] X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Jul 2003 05:20:56 -0000 Harald Schmalzbauer wrote: > Ok, like I thought, the disk was not defect. There seems to be a > bug in ata > regarding HPT372 > > First: Wiht BIOS version 2.342 the secondary master disk id is incorrectly > detected (something liek "X X X X X X X X X X X X X X X" instead of > "IC25N030ATCS04-0" Please forget that. It was because for convinience reasons I had turned the 80-pin ATA cables upside down. So the black was at the controller and the blue at the drive. I can't imagine that this makes any technical difference (as long as no slave drive is connected and there's no open end) But it seems the single connectors are electrical coded (again I can't imagine how?!?) I tested the following BIOS versions which all had the same result: the machine panics if one drive failed and there's no possibility to rebuild the failed array (under FreeBSD) 2.34 (original Dawicontrol) 2.341 (372N2341.p5e from Highpoint) 2.343 (3XXV2343.p4e from Highpoint) 2.2 (from Highpoint) The rest can be considered as confirmed > > I downgraded the BIOS to 2.2. > > Now I did the following test: > 1. created a RAID1 with the controllers BIOS(two Hitachi 2.5" Notebook > drives) > 2. installed DOS > 3. while DOS running I unpluged the (5v only) powersupply from one disk. > 4 After powering off I reconnected the power supply to the disk > 5. After switching on the controllers BIOS told me that the array > has to be > rebuild. > > So far it seems hardware is fine and working as designed. > > Now I installed FreeBSD 5.1 on the controller generated RAID1 ar0 > (it's name > in the BIOS is read as "RAID1_1" I don't know what names this exactly > reflects) > When I unplug one drive the same way like before (or even do a "atacontrol > detach 3 (the secondary channol of the controller)) FreeBSD warns me that > ar0 is degraded. In the atacontrol list the disk on channel 3 (ad6) > vanished. > Now after some time, the machine panics with the dump I already supplied > down this message (at least last time I didn't really unplug the power, > instead issued a "atacontrol detach 3"). > > Now when the machine is repowerd after corrected disk > connections, the BIOS > doesn't admit me to rebuild the array, but gives me the option to select a > replacement disk and rebuild. But this doesn't work, the error is > that there > are not enaugh spare disks. At the status I can see the arry > named "RAID1_1" > which was established via the controllers BIOS. When I choose "continue to > boot" I can see another array named "FreeBSD" which I never established. > When again continuing booting the kernel boots and then the > machine panics. > I have to delete the array. > After deleting the mirror the FreeBSD boots correct with degraded > ar0 but I > have no chance to rebuild the array. "atacontrol addspare ar0 > ad4" gives the > error liek (can't remember exactly) "sioctl (ATASPAREADD) not configured". > Also no detach/reinit/attach helps. > > I also think the RAID configuration is stored on the disks since when I > create a non-DOS compatible slice (starting at 0 not 63) the RAID > configuration vanishes. > > Now I assume that there are two different RAID configurations, > one stored on > disk by the controllers BIOS and anotherone which FreeBSD stores elsewhere > (e.g: with the sil0680 I can well create slices starting at 0). > Now when one drive fails both configurations are marked degraded but in a > different manner (because there is one array named "RAID1_1" and a second > which is named "FreeBSD") > And that's why FreeBSD panics until I delete the mirror relationship. > > This has nothing to do with the initiating crash coming from > "sysinstall or > sysctl -a" but is also ugly since the controller doesn't do it's job > correctly under FreeBSD. > > So I hope Soren can have a look at it or at least correct me if I'm wrong. > > Since this is my most important server I can't help you the next weeks. On > sunday I'll buy a SIL0680 based controller because I did the same > test with > it and it's working. > Now I'm currently setting up FreeBSD and building a kernel with DDB. > > Please let me know what I can do, I'm no programmer. I only know that > something like backtrace is usually useful. But I dnon't know > what backtrace > is, so if you'd need information from me please tell me axactly > what to do. > > Best regards, > > -Harry > > > > > > > Now after resetting the machine which was hung by "sysinstall" it claims > > that ad4 (one of two mirrored 30GB 2.5" disks" was absent (see > > dmesg below) > > Now the controller warns me that one drive is bad (which in fact is > > definatley not) and allows me to select "continue boot" > > That's what I do and after kernel probing the machine reboots with the > > folowing error (well, this takes some time to typewrite it from > > my monchrome > > screen): > > > > Fatal trap 12: page fault while in kernel mode > > fault virtual address = 0x10 > > fault code= supervisor read, page not present > > instruction pinter= 0x8:0xc014a0a6 > > stack pointer= 0x10:0xcce65bd8 > > frame pointer= 0x10:0xcce65c58 > > code segment = base 0x0, limit 0xfffff type 0x1b > > = DPL 0, pres 1, def32 1, gran 1 > > processor eflags = interrupt enabled, resume, IOPL=0 > > current process = 4(g_down) > > trap number = 12 > > panic: page fault > > > > Then it reboots! > > > > Now please give me a hint what to do. This is my brand new > > fileserver which > > collected all improtant data from the last decade and since it's > > brand new I > > didn't manage any backup. > > When testing the hardware (unplugging one drive while the machine was > > running) I had the same error but I thought that would never > happen under > > normal circumstances. > > > > If sysinstall breakes a RAID1 server 5.1-RELEASE should be immediately > > replaced by a corrected version!!!!! > > > > (Controller is a Dawicontrol DC-100 with HPT372 chipset and 2.343 > > BIOS, the > > original 2.34 BIOS didn't work at all with FreeBSD (while it did with > > Windows98)) > > The machine is the ######## VIA Fileserver######## like dmesg'ed below > > > > Best regards, > > > > -Harry > > > > P.S: Now it has not only ad6 in the following message but also > > ad4 (and that > > always has been the reason for the panic during my testings!) > (watch out > > the four ad4 and only two ad6) > > Opened disk ad4 -> 1 > > Opened disk ad4 -> 1 > > Opened disk ad4 -> 1 > > Opened disk ad4 -> 1 > > Opened disk ad6 -> 1 > > Opened disk ad6 -> 1 > > > > > Dear all, > > > > > > I'm experimenting with 5.1-REL for some weeks and during that > time I had > > > some mysterious hangs which I didn't take serious because I modified > > > /usr/src/sys/cam/scsi/scsi_da.c to support my CF-Card-Reader. > > > But now I saw exactly the same problem on my brand new (and > cosidered by > > > hardware extremely different) fileserver. > > > > > > The machine freezes for about one minute and then reboots itself > > > withut any > > > error message. > > > It happens when I do a "/stand/sysinstall" or a "sysctl -a" > > > > > > This is VERY ugly because when my fileserver dies my > > workstation also died > > > (home was nfs-mounted) > > > > > > I'm no developer, but if someone tells me what to do I'll help > > > solving that > > > BUG. > > > > > > Here is some info about my two machines (which are running > > 5.1-release and > > > showed the same bug): > > > > > > ######## VIA FIleserver ############ > > > > > > FreeBSD 5.1-RELEASE #2: Fri Jul 4 14:02:06 CEST 2003 > > > root@tek.flintsbach.schmalzbauer.de:/usr/obj/usr/src/sys/EPIA > > > Preloaded elf kernel "/boot/kernel/kernel" at 0xc04c0000. > > > Preloaded elf module "/boot/kernel/acpi.ko" at 0xc04c01f4. > > > Timecounter "i8254" frequency 1192944 Hz > > > Timecounter "TSC" frequency 800032401 Hz > > > CPU: VIA C3 Samuel 2 (800.03-MHz 686-class CPU) > > > Origin = "CentaurHauls" Id = 0x67a Stepping = 10 > > > Features=0x803035 > > > real memory = 266272768 (253 MB) > > > avail memory = 253374464 (241 MB) > > > VESA: v2.0, 2048k memory, flags:0x0, mode table:0xc00c8ac8 (c0008ac8) > > > VESA: Copyright 1998 TRIDENT MICROSYSTEMS INC. > > > npx0: on motherboard > > > npx0: INT 16 interface > > > acpi0: on motherboard > > > pcibios: BIOS version 2.10 > > > Using $PIR table, 5 entries at 0xc00fdc70 > > > acpi0: power button is handled as a fixed feature programming model. > > > Timecounter "ACPI-safe" frequency 3579545 Hz > > > acpi_timer0: <24-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0 > > > acpi_cpu0: port 0x530-0x537 on acpi0 > > > acpi_tz0: port 0x530-0x537 on acpi0 > > > acpi_button0: on acpi0 > > > pcib0: port > > > 0x6000-0x607f,0x5000-0x500f,0x4080-0x40ff,0x4000-0x407f,0xcf8-0xcf > > > f on acpi0 > > > pci0: on pcib0 > > > agp0: mem > > 0xd0000000-0xd3ffffff at device > > > 0.0 on pci0 > > > pcib1: at device 1.0 on pci0 > > > pci1: on pcib1 > > > pci1: at device 0.0 (no driver attached) > > > isab0: at device 17.0 on pci0 > > > isa0: on isab0 > > > atapci0: port 0xc000-0xc00f at > > > device 17.1 on > > > pci0 > > > ata0: at 0x1f0 irq 14 on atapci0 > > > ata1: at 0x170 irq 15 on atapci0 > > > uhci0: port 0xc400-0xc41f irq 5 at > > device 17.2 > > > on pci0 > > > usb0: on uhci0 > > > usb0: USB revision 1.0 > > > uhub0: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 > > > uhub0: 2 ports with 2 removable, self powered > > > uhci1: port 0xc800-0xc81f irq 5 at > > device 17.3 > > > on pci0 > > > usb1: on uhci1 > > > usb1: USB revision 1.0 > > > uhub1: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 > > > uhub1: 2 ports with 2 removable, self powered > > > pci0: at device 17.4 (no driver attached) > > > pcm0: port > > > 0xd400-0xd403,0xd000-0xd003,0xcc00-0xccff irq 12 > > > at device 17.5 on pci0 > > > pcm0: > > > vr0: port 0xd800-0xd8ff mem > > > 0xd8000000-0xd80000ff irq 10 at device 18.0 on pci0 > > > vr0: Ethernet address: 00:40:63:c2:9d:af > > > miibus0: on vr0 > > > ukphy0: on miibus0 > > > ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > > > atapci1: port > > > 0xec00-0xecff,0xe800-0xe803,0xe400-0xe407,0xe000-0xe003,0xdc00-0xd > > > c07 irq 11 > > > at device 20.0 on pci0 > > > ata2: at 0xdc00 on atapci1 > > > ata3: at 0xe400 on atapci1 > > > sio0 port 0x3f8-0x3ff irq 4 on acpi0 > > > sio0: type 16550A > > > orm0: