Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 29 May 2009 11:29:15 -0700
From:      Kip Macy <kmacy@freebsd.org>
To:        Larry Rosenman <ler@lerctr.org>
Cc:        freebsd-current@freebsd.org
Subject:   Re: ZFS Crash
Message-ID:  <3c1674c90905291129h7bd6fb6ai6ab772e3aed624d@mail.gmail.com>
In-Reply-To: <alpine.BSF.2.00.0905291242060.77764@thebighonker.lerctr.org>
References:  <alpine.BSF.2.00.0905250040230.1781@borg> <3c1674c90905242253n544c3f0cqb10952f349391ce7@mail.gmail.com> <454b8cc37c60ab7af2663ba70ddbfd59.squirrel@webmail.lerctr.org> <5a9a181a12e9e4ef864d23ae063f7277.squirrel@webmail.lerctr.org> <alpine.BSF.2.00.0905250803350.79867@borg> <alpine.BSF.2.00.0905260702300.1820@borg> <3c1674c90905280055h740bce23p33b18fefacf31196@mail.gmail.com> <alpine.BSF.2.00.0905280724480.58845@borg> <alpine.BSF.2.00.0905291242060.77764@thebighonker.lerctr.org>

next in thread | previous in thread | raw e-mail | index | archive | help
I'm fairly certain I know what the problem is. The (de)compress
functions allocate their own memory completely independently of the
arc limits. The allocations are blocking so the system will try to
page in attempt to provide the requested memory.


Cheers,
Kip

On Fri, May 29, 2009 at 10:44 AM, Larry Rosenman <ler@lerctr.org> wrote:
> On Thu, 28 May 2009, Larry Rosenman wrote:
>
>> On Thu, 28 May 2009, Kip Macy wrote:
>>
>>> On Tue, May 26, 2009 at 5:04 AM, Larry Rosenman <ler@lerctr.org> wrote:
>>>>
>>>> On Mon, 25 May 2009, Larry Rosenman wrote:
>>>>
>>>>> On Mon, 25 May 2009, Larry Rosenman wrote:
>>>>>
>>>>>> after looking at the code, never mind the "don't call doadump", so
>>>>>> we'll
>>>>>> get the textdump.
>>>>>>
>>>>>> Thanks rwatson for the textdump stuff!
>>>>>>
>>>>> Here is current stats before we crash. =A0Does any of this look total=
ly
>>>>> out of line?
>>>>>
>>>> It crashed again, but did *NOT* make it into ddb enough to do the
>>>> textdump.
>>>>
>>>> It was hung with the backtrace (looks like the same, but I couldn't
>>>> scroll the screen back).
>>>>
>>>> Ideas?
>>>>
>>>> I'm really concerned that there is a problem.
>>>>
>>>>
>>>>
>>>
>>>
>>> - Type of disks?
>>
>> 6 SATA Seagate 400GB (5) / 500 GB (1).
>>
>>
>> ATA channel 0:
>> =A0 Master: acd0 <Memorex DVD+-RAM 510L v1/MWS7> ATA/ATAPI revision 7
>> =A0 Slave: =A0 =A0 =A0 no device present
>> ATA channel 2:
>> =A0 Master: =A0ad4 <ST3400620AS/3.AAJ> SATA revision 2.x
>> =A0 Slave: =A0 =A0 =A0 no device present
>> ATA channel 3:
>> =A0 Master: =A0ad6 <ST3400620AS/3.AAJ> SATA revision 2.x
>> =A0 Slave: =A0 =A0 =A0 no device present
>> ATA channel 4:
>> =A0 Master: =A0ad8 <ST3500630AS/3.AAE> SATA revision 2.x
>> =A0 Slave: =A0 =A0 =A0 no device present
>> ATA channel 5:
>> =A0 Master: ad10 <ST3400620AS/3.AAJ> SATA revision 2.x
>> =A0 Slave: =A0 =A0 =A0 no device present
>> ATA channel 6:
>> =A0 Master: ad12 <ST3400620AS/3.AAJ> SATA revision 2.x
>> =A0 Slave: =A0 =A0 =A0 no device present
>> ATA channel 7:
>> =A0 Master: ad14 <ST3400620AS/3.AAJ> SATA revision 2.x
>> =A0 Slave: =A0 =A0 =A0 no device present
>>>
>>>
>>> - Size of zpools?
>>
>> All 6.
>>
>> =A0pool: vault
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> =A0 =A0 =A0 =A0corruption. =A0Applications may be affected.
>> action: Restore the file in question if possible. =A0Otherwise restore t=
he
>> =A0 =A0 =A0 =A0entire pool from backup.
>> =A0see: http://www.sun.com/msg/ZFS-8000-8A
>> scrub: none requested
>> config:
>>
>> =A0 =A0 =A0 =A0NAME =A0 =A0 =A0 =A0STATE =A0 =A0 READ WRITE CKSUM
>> =A0 =A0 =A0 =A0vault =A0 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>> =A0 =A0 =A0 =A0 =A0raidz1 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>> =A0 =A0 =A0 =A0 =A0 =A0ad6 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0 =A0ad8 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0 =A0ad10 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0 =A0ad12 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0 =A0ad14 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0ad4s1f =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>> =A0 =A0 =A0 =A0 =A0ad4s1e =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>> =A0 =A0 =A0 =A0 =A0ad4s1d =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>>
>> errors: 10 data errors, use '-v' for a list
>>
>>
>> =A0pool: vault
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> =A0 =A0 =A0 =A0corruption. =A0Applications may be affected.
>> action: Restore the file in question if possible. =A0Otherwise restore t=
he
>> =A0 =A0 =A0 =A0entire pool from backup.
>> =A0see: http://www.sun.com/msg/ZFS-8000-8A
>> scrub: none requested
>> config:
>>
>> =A0 =A0 =A0 =A0NAME =A0 =A0 =A0 =A0STATE =A0 =A0 READ WRITE CKSUM
>> =A0 =A0 =A0 =A0vault =A0 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>> =A0 =A0 =A0 =A0 =A0raidz1 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>> =A0 =A0 =A0 =A0 =A0 =A0ad6 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0 =A0ad8 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0 =A0ad10 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0 =A0ad12 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0 =A0ad14 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
>> =A0 =A0 =A0 =A0 =A0ad4s1f =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>> =A0 =A0 =A0 =A0 =A0ad4s1e =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>> =A0 =A0 =A0 =A0 =A0ad4s1d =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 =
0
>>
>> errors: Permanent errors have been detected in the following files:
>>
>> =A0 =A0 =A0 /usr/local/sbin/p4d
>> =A0 =A0 =A0 /var/db/bacula/borg-dir.conmsg
>> =A0 =A0 =A0 vault/usr/obj:<0x16c3a>
>> =A0 =A0 =A0 vault/usr/obj:<0x169bb>
>> =A0 =A0 =A0 /usr/obj/usr/src/lib/libc/random.o
>>
>>>
>>>
>>> - Compression enabled?
>>
>> Yes.
>>
>>
>>
>
> Ok, it just crashed. =A0Unfortunately, I'm at work and the box is at home=
.
>
> I did have my script running every minute of that entire boot.
>
> What I saw was a full backup running, and then we started paging, and the=
n
> the backup jobs got pager errors, and were killed.
>
> I'm not sure what else went on, so I restarted the bacula daemons that
> got killed, and was in the bacula console when it died.
>
> I'll see if I can get a cell-phone camera shot of the console.
>
> I'll also tar up the vmstat outputs and put them on my web server.
>
> What other forensics should I get? =A0Bear in mind the system is probably
> locked up with no dump taken :(
>
>
> --
> Larry Rosenman =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 http://www.lerctr.=
org/~ler
> Phone: +1 512-248-2683 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 E-Mail: ler@lerctr=
.org
> US Mail: 430 Valona Loop, Round Rock, TX 78681-3893
>



--=20
When bad men combine, the good must associate; else they will fall one
by one, an unpitied sacrifice in a contemptible struggle.

    Edmund Burke



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3c1674c90905291129h7bd6fb6ai6ab772e3aed624d>