Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Aug 2010 14:01:57 +0200
From:      =?ISO-8859-1?Q?Ren=E9_Ladan?= <rene@freebsd.org>
To:        freebsd-current@freebsd.org, kib@freebsd.org
Subject:   Re: Filesystem wedge, SUJ-related?
Message-ID:  <AANLkTimowSpdMSCfRKn2BV3GNO_MxeBBEyjbnRiacQ-0@mail.gmail.com>
In-Reply-To: <4C5E7F8A.7030800@freebsd.org>
References:  <alpine.LNX.2.00.1007171747130.19876@ury.york.ac.uk> <alpine.LNX.2.00.1007181352110.19876@ury.york.ac.uk> <4C5E7F8A.7030800@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

if I understand correctly, this is what r211212 and r211213 fixes?

Rene

2010/8/8 Rene Ladan <rene@freebsd.org>:
> On 18-07-2010 15:02, Gavin Atkinson wrote:
>>
>> On Sat, 17 Jul 2010, Gavin Atkinson wrote:
>>>
>>> Semi-regularly (every two-three days) I'm seeing what appears to be som=
e
>>> sort of filesystem wedge. =A0I usually see it initially with web browse=
rs,
>>> but it's possible that's only because it's what produces most disk
>>> activity on this machine. =A0I've seen it with both Opera and Firefox.
>>>
> I've been seeing this too. It still happens with kernel r211000.
>
>>> What happens is that the process will just wedge. =A0A "procstat -kk" o=
n it
>>> shows the following stack backtrace:
>>>
>>> =A09012 100243 firefox-bin =A0 =A0 =A0initial thread =A0 mi_switch+0x21=
d
>>> sleepq_switch+0x123 sleepq_wait+0x4d _sleep+0x357 getdirtybuf+0x21e
>>> flush_deplist+0x6f softdep_sync_metadata+0x153 ffs_syncvnode+0x213
>>> ffs_fsync+0x43 fsync+0x148 syscallenter+0x1b5 syscall+0x4c
>>> Xfast_syscall+0xe2
>>
> Firefox is the usual first sign:
> acer % ps ax|grep firefox
> 82117 =A0v0 =A0T =A0 =A0 16:24,08 /usr/local/lib/firefox3/firefox-bin
> 13416 =A0 3 =A0S+ =A0 =A0 0:00,00 grep firefox
> acer % procstat -kk 82117
> =A0PID =A0 =A0TID COMM =A0 =A0 =A0 =A0 =A0 =A0 TDNAME =A0 =A0 =A0 =A0 =A0=
 KSTACK
> 82117 100195 firefox-bin =A0 =A0 =A0- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mi_s=
witch+0x219
> thread_suspend_switch+0x103 thread_single+0x25c exit1+0x81 sigexit+0x84
> cursig+0 ast+0x1aa doreti_ast+0x1f
> 82117 100221 firefox-bin =A0 =A0 =A0initial thread =A0 mi_switch+0x219
> sleepq_switch+0xfa sleepq_wait+0x46 _sleep+0x256 getdirtybuf+0x1af
> flush_deplist+0x6a softdep_sync_metadata+0x153 ffs_syncvnode+0x22d
> ffs_fsync+0x43 fsync+0x13d syscallenter+0x194 syscall+0x41
> Xfast_syscall+0xe2
> acer %
>
>> A bit more detail: it does look like whatever is supposed to periodicall=
y
>> flush the journal just stops doing it's job. =A0Presumably this is also =
the
>> root cause of the "softdep: Out of journal space!" messages I have been
>> seeing in the past, which I had assumed may have been fixed by r209717.
>>
> I haven't seen any "softdep: Out of journal space!" messages since June 2=
4,
> but I've indeed seen it once before (somewhere after June 11).
>
>> (I'm running r209723 at the moment)
>>
>> While processes are starting to hang, "sh ffs" from ddb shows:
>>
>> db> =A0sh ffs
>> mp 0xffffff0002c45be0 / devvp 0xffffff0002c51000 fs 0xffffff0002c67000
>> su_wl 0 su_wl_in 0 su_deps 0 su_req 0
>> mp 0xffffff0002d705f0 /tmp devvp 0xffffff0002d48780 fs 0xffffff0002c6480=
0
>> su_wl 0 su_wl_in 0 su_deps 0 su_req 0
>> mp 0xffffff0002c458e8 /usr devvp 0xffffff0002d485a0 fs 0xffffff0002c6600=
0
>> su_wl 0 su_wl_in 0 su_deps 17345 su_req 0
>> mp 0xffffff0002c455f0 /var devvp 0xffffff0002d483c0 fs 0xffffff0002c6680=
0
>> su_wl 0 su_wl_in 0 su_deps 55 su_req 0
>>
>> Leaving it another couple of hours, I then see:
>>
>> db> =A0sh ffs
>> mp 0xffffff0002c45be0 / devvp 0xffffff0002c51000 fs 0xffffff0002c67000
>> su_wl 0 su_wl_in 0 su_deps 0 su_req 0
>> mp 0xffffff0002d705f0 /tmp devvp 0xffffff0002d48780 fs 0xffffff0002c6480=
0
>> su_wl 0 su_wl_in 0 su_deps 36 su_req 0
>> mp 0xffffff0002c458e8 /usr devvp 0xffffff0002d485a0 fs 0xffffff0002c6600=
0
>> su_wl 0 su_wl_in 0 su_deps 31899 su_req 0
>> mp 0xffffff0002c455f0 /var devvp 0xffffff0002d483c0 fs 0xffffff0002c6680=
0
>> su_wl 0 su_wl_in 0 su_deps 95 su_req 0
>>
>> so, su_deps is increasing significantly.
>>
>> During reboot, vnlru failed to stop within 60 seconds, and gave up on
>> syncing 125 vnodes and 140 buffers (no idea if these are related). =A0On
>> reboot, SU+J fsck shows for /usr:
>>
>> ** SU+J Recovering /dev/ad4s1f
>> ** Reading 33554432 byte journal from inode 150.
>> ** Building recovery table.
>> ** Resolving unreferenced inode list.
>> ** Processing journal entries.
>> ** 405991 journal records in 18194944 bytes for 71.40% utilization
>> ** Freed 3872 inodes (0 dirs) 48157 blocks, and 8744 frags.
>>
> Similar here.
>
>> So it seems clear that somehow the journal is filling up, and never bein=
g
>> written.
>>
>> Any other suggestions as to where I should go from here?
>>
> Disabling the journal would be a "solution", but not desirable.
> Maybe any lock order reversals to look out for (most are ufs-related) ?
>
> I don't know if it is related, but yesterday a full fsck on /usr cleared =
up
> two unallocated files in /usr/ports/editors/openoffice-3/work/ (they were=
 in
> userland as having a bad file descriptor), which the journal didn't catch=
.
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTimowSpdMSCfRKn2BV3GNO_MxeBBEyjbnRiacQ-0>