From owner-freebsd-stable@FreeBSD.ORG Wed Feb 15 23:41:59 2006 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 96D8616A420 for ; Wed, 15 Feb 2006 23:41:59 +0000 (GMT) (envelope-from paolo@euresis.it) Received: from agnus.ngi.it (ns.virtuo.it [88.149.128.9]) by mx1.FreeBSD.org (Postfix) with ESMTP id C968543D66 for ; Wed, 15 Feb 2006 23:41:50 +0000 (GMT) (envelope-from paolo@euresis.it) Received: from [192.168.2.16] (81-174-13-252.f5.ngi.it [81.174.13.252]) by agnus.ngi.it (8.13.5/8.13.5) with ESMTP id k1FNfX68019088; Thu, 16 Feb 2006 00:41:35 +0100 In-Reply-To: <20060210080350.GA5978@freebie.xs4all.nl> References: <43EA5C50.5020804@deepcore.dk> <20060208213704.GA703@freebie.xs4all.nl> <43EA6625.2070106@deepcore.dk> <20060208221056.GA1299@freebie.xs4all.nl> <43EB5393.5090502@deepcore.dk> <20060209144250.GB4874@freebie.xs4all.nl> <43EB55A1.9040405@deepcore.dk> <20060209201912.GA680@freebie.xs4all.nl> <43EBA4F7.7040407@deepcore.dk> <20060209220824.GA1499@freebie.xs4all.nl> <20060210080350.GA5978@freebie.xs4all.nl> Mime-Version: 1.0 (Apple Message framework v746.2) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <66854B6E-6D86-47BB-998F-6955401718BF@euresis.it> Content-Transfer-Encoding: 7bit From: Paolo Maero Date: Thu, 16 Feb 2006 00:41:32 +0100 To: Wilko Bulte X-Mailer: Apple Mail (2.746.2) Cc: stable@freebsd.org, =?ISO-8859-1?Q?S=F8ren_Schmidt?= Subject: Re: Showstopper ATA bug in 6.1-PRE? (Still happening) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2006 23:41:59 -0000 On Feb 10, 2006, at 9:03 AM, Wilko Bulte wrote: > On Thu, Feb 09, 2006 at 11:08:24PM +0100, Wilko Bulte wrote.. >> On Thu, Feb 09, 2006 at 09:24:23PM +0100, Sren Schmidt wrote.. >>> Wilko Bulte wrote: >>>> On Thu, Feb 09, 2006 at 03:45:53PM +0100, Sren Schmidt wrote.. >>>>> Wilko Bulte wrote: >>>>>> On Thu, Feb 09, 2006 at 03:37:07PM +0100, Sren Schmidt wrote.. >>>>>>> Wilko Bulte wrote: >>>>>>>> On Wed, Feb 08, 2006 at 10:44:05PM +0100, Sren Schmidt wrote.. >>>>>>>>> Wilko Bulte wrote: >>>>>>>>>> On Wed, Feb 08, 2006 at 10:02:08PM +0100, Sren Schmidt >>>>>>>>>> wrote.. >>>>>>>>>>> Wilko Bulte wrote: >>>>>>>>>>>> Hi Soren, >>>>>>>>>>>> >>>>>>>>>>>> I just went to 6.1-PRE on my main machine, coming from >>>>>>>>>>>> 6.0-STABLE >>>>>>>>>>>> of roughly end of december. >>>>>>>>>>>> >>>>>>>>>>>> And I hit some stuff that really worries me: >>>>>>>>>>>> >>>>>>>>>>>> - the freshly built kernel keels over with (hand >>>>>>>>>>>> transcribed): >>>>>>>>>>>> >>>>>>>>>>>> ata3: reiniting channel SATA connect ... >>>>>>>>>>>> SATA connected >>>>>>>>>>>> sata_connect_devices 0x1 >>>>>>>>>>>> >>>>>>>>>>>> ad6: req=0xC35ba0c8 SETFEATURES SETTRANSFERMODE >>>>>>>>>>>> semaphore timeout >>>>>>>>>>>> !! DANGER Will RObinson !! >>>>>>>>>>>> >>>>>>>>>>>> (... is where I cannot read my own handwriting, it >>>>>>>>>>>> scrolled quite >>>>>>>>>>>> fast on >>>>>>>>>>>> the screen..) >>>>>>>>>>>> >>>>>>>>>>>> Boot device is a SATA RAID1 on a Promise 2300. >>>>>>>>>>> Hmm, that should not happen. Could you try to backstep >>>>>>>>>>> just ATA to >>>>>>>>>>> before the MFC, that is 24/1/06 and let me know if that >>>>>>>>>>> helps >>>>>>>>>>> please ? >>>>>>>>>> First impression is that the problem is gone. None of the >>>>>>>>>> previously reported errors are seen. I am running a level >>>>>>>>>> 0 dump >>>>>>>>>> from disk to disk >>>>>>>>>> to see if the box remains stable. Given that this is my >>>>>>>>>> primary >>>>>>>>>> machine >>>>>>>>>> I sure hope it will be :-) >>>>>>>>>> >>>>>>>>>>>> Another snag is that my ad10 disk on 6.0-STABLE suddenly >>>>>>>>>>>> became >>>>>>>>>>>> ad12 on >>>>>>>>>>>> 6.1-PRE >>>>>>>>>>> Hmm that is because there is only 2 ports on your promise >>>>>>>>>>> which is >>>>>>>>>>> now correctly identified, before it was errounsly found >>>>>>>>>>> as 3 ports. >>>>>>>>>> Ah, OK. I would suggest a note to the Release Note >>>>>>>>>> writers would be >>>>>>>>>> a good >>>>>>>>>> thing, devices changing location after an upgrade in the - >>>>>>>>>> stable >>>>>>>>>> branch >>>>>>>>>> is unnerving ;-) >>>>>>>>> Well, the good thing is that I can reproduce the error >>>>>>>>> here, the bad >>>>>>>>> thing is that it slipped through testing on -current... >>>>>>>>> Oh, well, I'll look into it ASAP... >>>>>>>> Thank you Soren! >>>>>>> OK, had a few this afternoon, could you try this patch and >>>>>>> let me know >>>>>>> if it helps, at least it makes the problem go away on my >>>>>>> testbed.. >>>>>> Is this relative to HEAD or RELENG_6? I cannot / will not go >>>>>> to HEAD >>>>>> with this machine (my main production box.. :-) >>>>> Doesn't matter, ATA is the same on both... >>>> >>>> OK, I was not sure if they were 100% identical. >>>> >>>> The patch at first impression seems to have eliminated the problem. >>> >>> Good seems I'm on the right track at least. >>> >>>> Interestingly enough ad10 remained ad10 with the patch applied? >>> >>> Yeah, thats intentional, I though we better not break POLA here.. >> >> I agree :-) >> >>>> I'll put some load on to see what happens. >>> >>> Let me know how that turns out, I'll clean things up a bit and >>> get it >>> committed to -current, then get permission to MFC when we are >>> sure it >>> fixes the problem... >> >> I ran a 44GB disk-to-disk dump without incidents (source on the >> RAID1, >> target on the JBOD). No problems whatsoever. >> >> Looks like things behave much better now. Tonight the machine will >> run a daily full dump to DLT tape, I'll know how that turns out >> tomorrow. > > Backup ran without problems. > > -- > Wilko Bulte wilko@FreeBSD.org > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable- > unsubscribe@freebsd.org" > Soren, I am still getting this error. I cvsup'd to RELENG_6 today (Feb. 15, 2006 at 6:00 PM CET) and I got the following files from HEAD: ata-all.c (1.264) ata-all.h (1.113) ata-chipset.c (1.158) ata-pci.c (1.115) I have then rebuilt the kernel and world and I tested the system. I have a Promise TX2300 with two 250 GB Maxtor in RAID1 with gmirror, plus two 9 GB SCSI drives in RAID1 with gmirror where I put the OS and applications and a 160 GB UATA Seagate (attached to the motherboard UATA) where I backup sensitive data. The system was able to rebuild the mirror (I always got the error with RELENG_6_0), but then I made a backup tar from the two Maxtor's mirror to the Seagate and I got the following error after having copied approx 16 GB: Feb 15 20:56:31 zoe kernel: GEOM_MIRROR: Device data1: provider ad4 disconnected. Feb 15 20:56:31 zoe kernel: GEOM_MIRROR: Device data1: provider ad4 detected. Feb 15 20:56:31 zoe kernel: GEOM_MIRROR: Device data1: rebuilding provider ad4. ... Feb 15 22:33:15 zoe kernel: GEOM_MIRROR: Device data1: rebuilding provider ad4 finished. Feb 15 22:33:15 zoe kernel: GEOM_MIRROR: Device data1: provider ad4 activated. ... Feb 15 23:04:12 zoe kernel: ad4: req=0xc2422578 SETFEATURES SET TRANSFER MODE semaphore timeout !! DANGER Will Robinson !! Feb 15 23:04:28 zoe last message repeated 4 times Feb 15 23:04:32 zoe kernel: ad4: req=0xc2422578 SETFEATURES SET TRANSFER MODE semaphore timeout !! DANGER Will Robinson !! Feb 15 23:05:04 zoe last message repeated 8 times Feb 15 23:07:08 zoe last message repeated 31 times Feb 15 23:08:56 zoe last message repeated 27 times Feb 15 23:09:00 zoe kernel: ad4: req=0xc2422578 SETFEATURES SET TRANSFER MODE semaphore timeout !! DANGER Will Robinson !! Feb 15 23:09:32 zoe last message repeated 8 times Feb 15 23:11:36 zoe last message repeated 31 times Feb 15 23:21:40 zoe last message repeated 151 times Feb 15 23:31:40 zoe last message repeated 150 times Feb 15 23:34:28 zoe last message repeated 42 times Feb 15 23:34:32 zoe kernel: ad4: req=0xc2422578 SETFEATURES SET TRANSFER MODE semaphore timeout !! DANGER Will Robinson !! Feb 15 23:35:04 zoe last message repeated 8 times Feb 15 23:37:08 zoe last message repeated 31 times Feb 15 23:40:52 zoe last message repeated 56 times Feb 15 23:40:53 zoe shutdown: power-down by paolo: Feb 15 23:40:56 zoe syslogd: exiting on signal 15 I was not able to shutdown the system cleanly. I also noticed a small problem in the code (probably not related to this error): At line 1011 of src/sys/dev/ata/ata-all.c a free() is used but ata_delayed_attach is not reset to NULL Actual (1.264): if (config_intrhook_establish(ata_delayed_attach) != 0) { printf("ata: config_intrhook_establish failed\n"); free(ata_delayed_attach, M_TEMP); } Fix: if (config_intrhook_establish(ata_delayed_attach) != 0) { printf("ata: config_intrhook_establish failed\n"); free(ata_delayed_attach, M_TEMP); + ata_delayed_attach = NULL; } This is my dmesg: Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 6.1-PRERELEASE #0: Wed Feb 15 20:27:17 CET 2006 root@zoe.paolo.maero.net:/usr/obj/usr/src/sys/ZOE_022 ACPI APIC Table: Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel Pentium III (696.41-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x683 Stepping = 3 Features=0x387fbff real memory = 536805376 (511 MB) avail memory = 516100096 (492 MB) FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 1 cpu1 (AP): APIC ID: 0 ioapic0 irqs 0-23 on motherboard npx0: [FAST] npx0: on motherboard npx0: INT 16 interface acpi0: on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0xc08-0xc0b on acpi0 cpu0: on acpi0 cpu1: on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 agp0: mem 0xf8000000-0xfbffffff at device 0.0 on pci0 pcib1: at device 1.0 on pci0 pci1: on pcib1 pcib2: at device 15.0 on pci1 pci2: on pcib2 atapci0: port 0x3400-0x347f, 0x3000-0x30ff mem 0xf4220000-0xf4220fff,0xf4200000-0xf421ffff irq 23 at device 7.0 on pci2 ata2: on atapci0 ata3: on atapci0 ata4: on atapci0 ahc0: port 0x2400-0x24ff mem 0xf4101000-0xf4101fff irq 19 at device 12.0 on pci0 ahc0: [GIANT-LOCKED] aic7896/97: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs ahc1: port 0x2800-0x28ff mem 0xf4102000-0xf4102fff irq 19 at device 12.1 on pci0 ahc1: [GIANT-LOCKED] aic7896/97: Ultra2 Wide Channel B, SCSI Id=7, 32/253 SCBs fxp0: port 0x2000-0x203f mem 0xf4100000-0xf4100fff,0xf4000000-0xf40fffff irq 21 at device 14.0 on pci0 miibus0: on fxp0 inphy0: on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: Ethernet address: 00:d0:b7:88:6c:c5 isab0: at device 18.0 on pci0 isa0: on isab0 atapci1: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x2040-0x204f at device 18.1 on pci0 ata0: on atapci1 ata1: on atapci1 uhci0: port 0x2060-0x207f irq 21 at device 18.2 on pci0 uhci0: [GIANT-LOCKED] usb0: on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered intpm0: port 0x1040-0x104f irq 9 at device 18.3 on pci0 intpm0: I/O mapped 1040 intpm0: intr IRQ 9 enabled revision 0 intpm0: [GIANT-LOCKED] intsmb0: on intpm0 smbus1: on intsmb0 smb0: on smbus1 intpm0: PM I/O mapped c00 pci0: at device 20.0 (no driver attached) atkbdc0: port 0x60,0x64 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] fdc0: port 0x3f4-0x3f7 irq 6 drq 2 on acpi0 fdc0: No FDOUT register! device_attach: fdc0 attach returned 6 ppc0: port 0x378-0x37f,0x778-0x77f irq 7 drq 3 on acpi0 ppc0: Generic chipset (ECP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/8 bytes threshold ppbus0: on ppc0 plip0: on ppbus0 lpt0: on ppbus0 lpt0: Interrupt-driven port ppi0: on ppbus0 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A fdc0: port 0x3f4-0x3f7 irq 6 drq 2 on acpi0 fdc0: No FDOUT register! device_attach: fdc0 attach returned 6 pmtimer0 on isa0 orm0: at iomem 0xc0000-0xc7fff,0xc8000-0xcdfff, 0xce000-0xcf7ff on isa0 sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 ulpt0: EPSON USB Printer, rev 1.10/1.00, addr 2, iclass 7/1 ulpt0: using bi-directional mode Timecounters tick every 1.000 msec ad0: 152627MB at ata0-master UDMA33 acd0: CDRW at ata1-master UDMA33 ad4: 239372MB at ata2-master SATA150 GEOM_MIRROR: Device data1 created (id=618669616). GEOM_MIRROR: Device data1: provider ad4 detected. ad6: 239372MB at ata3-master SATA150 Waiting 5 seconds for SCSI devices to settle GEOM_MIRROR: Device data1: provider ad6 detected. GEOM_MIRROR: Device data1: provider ad6 activated. GEOM_MIRROR: Device data1: provider ad4 is stale. GEOM_MIRROR: Device data1: provider mirror/data1 launched. SMP: AP CPU #1 Launched! da0 at ahc0 bus 0 target 0 lun 0 da0: Fixed Direct Access SCSI-3 device da0: 80.000MB/s transfers (40.000MHz, offset 30, 16bit), Tagged Queueing Enabled da0: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C) da1 at ahc0 bus 0 target 1 lun 0 da1: Fixed Direct Access SCSI-3 device da1: 80.000MB/s transfers (40.000MHz, offset 30, 16bit), Tagged Queueing Enabled da1: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C) GEOM_MIRROR: Device sgm0 created (id=1390498924). GEOM_MIRROR: Device sgm0: provider da0 detected. GEOM_MIRROR: Device sgm0: provider da1 detected. GEOM_MIRROR: Device sgm0: provider da1 activated. GEOM_MIRROR: Device sgm0: provider da0 activated. GEOM_MIRROR: Device sgm0: provider mirror/sgm0 launched. Trying to mount root from ufs:/dev/mirror/sgm0s1a Regards Paolo