Date: Thu, 16 Feb 2012 09:03:16 -0800 (PST) From: john fleming <jflemingeds@yahoo.com> To: "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org> Subject: Re: 6.2-Release ..ish.. CF + ata == freeze? Message-ID: <1329411796.23457.YahooMailNeo@web111719.mail.gq1.yahoo.com> In-Reply-To: <1056033736-1329271037-cardhu_decombobulator_blackberry.rim.net-225645010-@b15.c31.bise6.blackberry> References: <1329194588.14324.YahooMailNeo@web111720.mail.gq1.yahoo.com> <20120214051828.GA89777@icarus.home.lan> <1056033736-1329271037-cardhu_decombobulator_blackberry.rim.net-225645010-@b15.c31.bise6.blackberry>
next in thread | previous in thread | raw e-mail | index | archive | help
The plot is starting to thicken. I've noticed all the systems that have don= e this (so far) have this flash card on them.=0A=0ASTEC M2+ CF 9.0.2 K1186-= 2=0A=0A=0AFrom talking to checkpoint this is a newer flash they have starte= d using. I just had a 4th machine do the same thing yesterday. Basic instal= l, about %70 disk space free, very new install, like 1-2 month and the up t= ime on the machine in question was only 16 days. After rebooting i did a fe= w dd if=3D/dev/zero of=3D~/file bs=3D1m count=3D350 and didn't get any erro= rs.=0A=0AThe latest machine is a 1 gig version of the flash listed above, s= o this ate almost all the free disk space. Checkpoint is asking that we RAM= one of the flash cards so they can play with it.=0A=0A=0A_________________= _______________=0A From: "jflemingeds@yahoo.com" <jflemingeds@yahoo.com>=0A= To: Jeremy Chadwick <freebsd@jdc.parodius.com> =0ACc: "freebsd-stable@freeb= sd.org" <freebsd-stable@freebsd.org> =0ASent: Tuesday, February 14, 2012 7:= 57 PM=0ASubject: Re: 6.2-Release ..ish.. CF + ata =3D=3D freeze?=0A =0A2 of= the 3 cf cards are very new, like less then 6 months old. =0A=0AI think ar= ound 65-70 percent is in use. This number doesn't change unless the user du= mps data in a home dir, which isn't the case so far. =0A=0AYou are correct = that only writes are failing. Msgbuf has more then what I pasted but I'm pr= etty sure its just more of the same errors. Ill redouble my check. =0A=0ATh= e other slices are very small. One is 35 meg the other is 100 some odd meg.= H is 1.2 gig.=A0 =0A=0AI don't know if ill be able to try the dd test for = a few reasons but ill check it out. Let me ask you this. Say zeroing out th= e drive works without error. Does that tell me anything?=A0 =0A=0AI also do= n't have access to smart tools as this is basically a closed system and the= vendor would never give us access to a complier. Granted I haven't tried j= ust throwing on gcc from 6.2. I could play with that or maybe since said ve= ndor's dev team is keeping track of this thread they could provide said bin= ary :). =0A=0AI really don't like the idea of replacing hardware as I'm loo= king at around 200 boxes. I really hope it doesn't come to that. =0A=0AThan= ks for the reply!=0A=0ASent via BlackBerry from T-Mobile=0A=0A-----Original= Message-----=0AFrom: Jeremy Chadwick <freebsd@jdc.parodius.com>=0ADate: Mo= n, 13 Feb 2012 21:18:28 =0ATo: john fleming<jflemingeds@yahoo.com>=0ACc: fr= eebsd-stable@freebsd.org<freebsd-stable@freebsd.org>=0ASubject: Re: 6.2-Rel= ease ..ish.. CF + ata =3D=3D freeze?=0A=0AOn Mon, Feb 13, 2012 at 08:43:08P= M -0800, john fleming wrote:=0A> Just thought i would post over here as i'm= not getting a warm fuzzy from checkpoint about being able to find the root= cause of an issue. I have a large install base of IPSO checkpoint firewall= s, which are based on FreeBSD 6.2. I've had 3 firewalls hang basically the = same way, with something that looks like a filesystem issue or an?issue wit= h a CF card. =0A=0AFreeBSD 6.2 was EOL'd in early-to-mid-2008.=A0 The ATA d= river has changed=0Asignificantly since then (present-day uses CAM).=0A=0A>= Does anyone happen to know of any bugs (i've been looking around) that cou= ld cause something like that? Granted, it could be a batch of bad CF cards,= but its odd that i'm seeing the same thing on 3 different boxes and once r= ebooted they seem ok.=0A> ?=0A> Also is it possible to get useful info form= the atacontroller when things go south like this from the ddb prompt?=0A= =0ANot particularly.=A0 What's shown below indicates that the driver had=0A= issued some form of ATA write command (there are multiple kinds per ATA=0As= pecification), and either the underlying media (CF/disk) or controller=0Ast= alled/locked up/took too long.=A0 I forget what the timeout value is in=0A6= .2; I can't be bothered to remember such from 6 years ago.=A0 :-)=0A=0A> Th= is is what shows in show msgbuf=0A> ad0: timeout waiting to issue command= =0A> ad0: error issuing WRITE command=0A> ad0: timeout waiting to issue com= mand=0A> ad0: error issuing WRITE command=0A> ad0: timeout waiting to issue= command=0A> ad0: error issuing WRITE command=0A> ad0: timeout waiting to i= ssue command=0A> ad0: error issuing WRITE command=0A> g_vfs_done():ad0s4h[W= RITE(offset=3D33849344, length=3D131072)]error =3D 5 =0A> g_vfs_done():ad0s= 4h[WRITE(offset=3D33980416, length=3D131072)]error =3D 5 =0A> g_vfs_done():= ad0s4h[WRITE(offset=3D34111488, length=3D131072)]error =3D 5=0A> ?g_vfs_don= e():ad0s4h[WRITE(offset=3D34242560, length=3D131072)]error =3D 5 =0A> g_vfs= _done():ad0s4h[WRITE(offset=3D34373632, length=3D131072)]error =3D 5 =0A=0A= error 5 =3D EIO =3D Input/output error.=A0 But this isn't too big of a=0Asu= rprise given the timeouts you see prior.=0A=0AAre these CF cards brand new = -- meaning, are they completely unused=0A(having never had any writes done = to them), or have they been in use a=0Awhile?=A0 I'm betting they've been i= n use a while, and have probably been=0Adoing many writes over the years.= =0A=0ATwo things to note here:=0A=0A1) The errors you've shown are only hap= pening on writes, not reads.=A0 Of=0Acourse if you omitted information then= this isn't an accurate statement.=0A2) Timeouts are seen when issuing writ= es to some LBA regions.=0A=0AHow full is the CF card, disk-space-wise?=A0 N= ot just ad0s4h, I'm talking=0Aabout the entire card.=A0 How much space is r= oughly available?=A0 They're=0Avery small CF cards (1.8GByte roughly), and = the less space available,=0Athe less effectiveness of wear levelling (and i= n some cases the slower=0Athe writes are).=0A=0AReason I ask: given that th= ese are CF cards, this smells of cards which=0Aare simply "worn down".=A0 C= F cards have limited numbers of writes, and=0Athe card may be "freaking out= " internally when attempting to write to=0Asome LBAs which map to CF sector= s that are, in effect, "bad".=A0 The CF=0Acards' ECC implementation may be = buggy, or may simply be "spinning hard"=0Afor too long.=A0 You can read abo= ut this sort of behaviour on Wikipedia's=0ACompactFlash article.=0A=0AYou w= ouldn't be able to verify this with dd if=3D/dev/ad0, because those=0Aare r= ead operations.=A0 You could zero the media (dd if=3D/dev/zero=0Aof=3D/dev/= ad0) as a form of verification if you wanted.=0A=0ADo you happen to know if= these CF cards support SMART?=A0 If so,=0Ainstalling smartmontools (versio= n 5.42 or newer please) and providing=0Aoutput from "smartctl -a /dev/ad0" = may be helpful to me, but I make no=0Aguarantees anything of use will be sh= own there.=0A=0AOverall my advice would be to replace the CF cards, especia= lly if they=0Ahave been in use for a long while.=A0 It really doesn't matte= r to me that=0Ait's happening on 3 machines (honest), especially if these a= re 6.2=0Amachines with CF cards that have been in use for years.=A0 We're l= ucky to=0Aget 2 years out of our CF cards on our Juniper M120/320s before t= hey=0Astart spitting I/O errors.=A0 Pick larger CF cards as well; more spac= e =3D=0Amore room for effective wear levelling.=0A=0A> ?=0A> ad0: 1882MB <S= TEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4=0A> atapci0: <Intel 6300ESB U= DMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f m= em 0x80301000-0x803013ff at device 31.1 on pci0=0A> ata0: <ATA channel 0> o= n atapci0=0A> ata1: <ATA channel 1> on atapci0=0A> atapci1: <Intel 6300ESB = SATA150 controller> port 0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0= x50a3,0x5060-0x506f irq 15 at device 31.2 on pci0=0A> ata2: <ATA channel 0>= on atapci1=0A> ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w u= fs partition on the box where almost anything that needs to be written goes= .=0A> trace=0A> Tracing pid 1101 tid 100043 td 0x656d8460=0A> kdb_enter(608= cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b=0A> siointr1(6= 4ba1400) at siointr1+0xf0=0A> siointr(64ba1400) at siointr+0x38=0A> intr_ex= ecute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at intr_execute_ha= ndler+0x61=0A> intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at= intr_execute_handlers+0x40=0A> atpic_handle_intr(4) at atpic_handle_intr+0= x96=0A> Xatpic_intr4() at Xatpic_intr4+0x20=0A> --- interrupt, eip =3D 0x60= 6044af, esp =3D 0xf0a4ab48, ebp =3D 0xf0a4ab5c ---=0A> lockmgr(e1456a04,6,0= ,656d8460) at lockmgr+0x58f=0A> getdirtybuf(e14569a4,60a405e4,1) at getdirt= ybuf+0x2e2=0A> flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30=0A>= flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf=0A> softde= p_sync_metadata(65964618) at softdep_sync_metadata+0x61=0A> ffs_syncvnode(6= 5964618,1) at ffs_syncvnode+0x3a2=0A> ffs_fsync(f0a4ac74) at ffs_fsync+0x12= =0A> VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38=0A> fsync(656d8= 460,f0a4acb4) at fsync+0x170=0A> syscall(805003b,806003b,5fbf003b,8050000,2= 88be450,...) at syscall+0x2ee=0A> Xint0x80_syscall() at Xint0x80_syscall+0x= 1f=0A=0A-- =0A| Jeremy Chadwick=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 jdc@parodius.com |=0A| Parodius Networking=A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 http://www.parodius.com/ |=0A| UNIX Systems Adm= inistrator=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Mountain View, CA, US |=0A| Maki= ng life hard for others since 1977.=A0 =A0 =A0 =A0 =A0 =A0 PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1329411796.23457.YahooMailNeo>