Date: Mon, 2 Jun 2008 22:31:45 +0300 From: Ruslan Kovtun <yalur@mail.ru> To: freebsd-fs@freebsd.org Cc: Andrew Hill <lists@thefrog.net> Subject: Re: ZFS lockup in "zfs" state Message-ID: <200806022231.46079.yalur@mail.ru> In-Reply-To: <20080602064023.GA95247@eos.sc1.parodius.com> References: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> <16a6ef710806012304m48b63161oee1bc6d11e54436a@mail.gmail.com> <20080602064023.GA95247@eos.sc1.parodius.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi. I have the same problem very often with HDD (READ_DMA UDMA ICRC error) whi= ch=20 is in zfs pool. Before, this HDD was in mirror ar0 but not in ZFS pool and= =20 this hard disk sometimes have failed but with no any panic only detached fr= om=20 mirror. After I included this HDD to ZFS pool problem have apeared. I am= =20 sure that this is problem with hard disk.=20 Smartmontools notified me by mail that UDMA_CRC_Error_Count have increased= =20 after HDD failure and acording smartctl I can see that HDD have hardware=20 problem. I replased cable, tried to connect this HDD to another port - but= =20 no result: 100% hard disk problem. I can not create kernel coredump during panic: savecore: no dumps found :( Only logs are available: In log file: Jun 1 10:43:11 yalur kernel: ad16: WARNING - READ_DMA UDMA ICRC error=20 (retrying request) LBA=3D233909187 Jun 1 10:43:20 yalur kernel: ad16: WARNING - SETFEATURES SET TRANSFER MODE= =20 taskqueue timeout - completing request directly Jun 1 10:43:36 yalur kernel: ad16: WARNING - SETFEATURES SET TRANSFER MODE= =20 taskqueue timeout - completing request directly Jun 1 10:43:36 yalur kernel: ad16: WARNING - SETFEATURES ENABLE RCACHE=20 taskqueue timeout - completing request directly Jun 1 10:43:36 yalur kernel: ad16: WARNING - SETFEATURES ENABLE WCACHE=20 taskqueue timeout - completing request directly Jun 1 10:43:36 yalur kernel: ad16: WARNING - SET_MULTI taskqueue timeout -= =20 completing request directly Jun 1 10:43:36 yalur kernel: ad16: TIMEOUT - READ_DMA retrying (0 retries= =20 left) LBA=3D233909187 Jun 1 11:07:50 yalur syslogd: restart Jun 1 11:07:50 yalur syslogd: kernel boot file is /boot/kernel/kernel Jun 1 11:07:50 yalur kernel: ad16: FAILURE - device detached Jun 1 11:07:50 yalur kernel: subdisk16: detached Jun 1 11:07:50 yalur kernel: ad16: detached Jun 1 11:07:50 yalur kernel: Jun 1 11:07:50 yalur kernel: Jun 1 11:07:50 yalur kernel: Fatal trap 12: page fault while in kernel mode Jun 1 11:07:50 yalur kernel: cpuid =3D 0; apic id =3D 00 Jun 1 11:07:50 yalur kernel: fault virtual address =3D 0x2c Jun 1 11:07:50 yalur kernel: fault code =3D supervisor writ= e,=20 page not present Jun 1 11:07:50 yalur kernel: instruction pointer =3D 0x20:0x805aab85 Jun 1 11:07:50 yalur kernel: stack pointer =3D 0x28:0xed71ac5c Jun 1 11:07:50 yalur kernel: frame pointer =3D 0x28:0xed71ac70 Jun 1 11:07:50 yalur kernel: code segment =3D base 0x0, limit= =20 0xfffff, type 0x1b Jun 1 11:07:50 yalur kernel: =3D DPL 0, pres 1, def32 1, gran 1 Jun 1 11:07:50 yalur kernel: processor eflags =3D interrupt enabled, resu= me,=20 IOPL =3D 0 Jun 1 11:07:50 yalur kernel: current process =3D 3 (g_up) Jun 1 11:07:50 yalur kernel: trap number =3D 12 Jun 1 11:07:50 yalur kernel: panic: page fault Jun 1 11:07:50 yalur kernel: cpuid =3D 0 [root@yalur /home/ruslan]# zpool status pool: data state: ONLINE scrub: scrub completed with 0 errors on Mon Jun 2 12:05:52 2008 config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad6 ONLINE 0 0 0 ad8 ONLINE 0 0 0 ad10 ONLINE 0 0 0 ad4 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad12 ONLINE 0 0 0 ad14 ONLINE 0 0 0 ad16 ONLINE 0 0 0 ad20 ONLINE 0 0 0 spares ad26 AVAIL errors: No known data errors =F7 =D3=CF=CF=C2=DD=C5=CE=C9=C9 =CF=D4 =F0=CF=CE=C5=C4=C5=CC=D8=CE=C9=CB 02= =C9=C0=CE=D1 2008 Jeremy Chadwick =CE=C1=D0=C9=D3=C1=CC(a): > On Mon, Jun 02, 2008 at 04:04:12PM +1000, Andrew Hill wrote: > > On Mon, May 19, 2008 at 1:11 AM, Andrew Hill <lists@thefrog.net> wrote: > > > i tend to find that the timeouts occur on one or two disks at once - > > > e.g. ad0 and 2 will complain of timeouts, and the system locks up > > > shortly thereafter... > > > > after spitting out the usual errors from ad0 and ad2 (in this case) with > > TIMEOUTs and subsequent FAILUREs on READ_DMA[48] and WRITE_DMA[48]... > > > > i got the following panic > > > > vm_fault: pager read error, pid 1552 (tlsmgr) > > ad0: FAILURE - READ_DMA48 timed out LBA=3D352903900 > > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 437, size: 4096 > > ad2: FAILURE - WRITE_DMA timed out LBA=3D239717693 > > panic: ZFS: I/O failure (write on <unknown> off 0: zio 0xffffff001d47c8= 10 > > [L0 ZIL intent log] b000L/b000P DVA[0]=3D<0:c807795000:d000> zilog > > uncompressed LE contiguous birth=3D750230 fill=3D0 > > cksum=3D69f76525a84e1816:f6d86fe1d94cd68c:39:8af): error 5 > > KDB: enter: panic > > [thread pid 72 tid 100071 ] > > Stopped at kdb_enter_why+0x3d: movq $0,0x39b248(%rip) > > db> > > I would say the ZFS crash is a result of the ad0/ad2 timeouts. The ZIL > log shows a hard checksum failure in the ZIL, which indicates a serious > problem -- very likely hardware-related (or rather, at a lower level > than ZFS). > > You've read this already, but maybe you missed the DMA error part: > > http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues > > The DMA errors can actually be legitimate too -- it's very hard to > troubleshoot if they're superfluous (e.g. a FreeBSD bug) or if they're > real. If the problem is reproducable, then this is convenient with > regards to providing you additional help. > > I really need to sit down and write a huge HOWTO doc for people on how > to diagnose whether or not their disks or cables are bad, etc... It's a > very hard thing to document, because everyone's situation is different. > > The first piece to start with is simplest, though: install > ports/sysutils/smartmontools and provide the output of "smartctl -a > /dev/ad0" and /dev/ad2. Actual disk errors will very likely show up > there in one of the counters, or in the SMART log. I'd personally like > to see the output from smartctl, because it's something you can do while > the system is up/working. > > The next step would involve replacing your cables. If the problem > continues, you've at least removed one piece of the puzzle. > > Next, replace the disks -- especially if they were bought at the same > time, and are from the same vendor. Hard disk vendors are known to have > bad batches of disks. For sake of example, I just had two Western > Digital disks (which I bought at the same time) fail a short I/O test, > returning errors at different LBAs (blocks). The 2nd one only started > showing problems a few weeks after the first. I obviously got both of > them RMA'd. > > Finally, replace the controller or motherboard. Some people have > reported success with this. > > > generally the lockups don't result in a panic (at least not in the short > > term of 5-10 minutes), so i can't be sure that this panic is necessarily > > caused by the same problem, but thought it might be worth posting in ca= se > > it gives an indication of the location/cause of the deadlock > > The DMA timeout errors you've seen, others have seen as well -- > including me -- even when the hardware, disks, cabling, and controllers > are in a 100% working state. (Even switching OSes results in no errors, > indicating there is a problem with FreeBSD in some way.) > > If the problem is reproducable, you should get in contact with Scott > Long and let him poke at things. (I mentioned this last time. :-) ) > I myself am not familiar with the FreeBSD kernel, the device drivers, or > working with the kernel at such a low level to debug things of this > nature. > > > unfortunately i couldn't get a backtrace or core dump for 'political' > > reasons (the system was required for use by others) but i'll see if i c= an > > get a panic happening after-hours to get some more info... > > I can't tell you what to do or how to do your job, but honestly you > should be pulling this system out of production and replacing it with a > different one, or a different implementation, or a different OS. Your > users/employees are probably getting ticked off at the crashes, and it > probably irritates you too. The added benefit is that you could get > Scott access to the box. =2D-=20 ________________ =F3 =D5=D7=C1=D6=C5=CE=C9=C5=CD =EB=CF=D7=D4=D5=CE =F2=D5=D3=CC=C1=CE mailto <yalur@mail.ru>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200806022231.46079.yalur>