Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Feb 2013 11:41:45 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Marc Fournier <scrappy@hub.org>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, freebsd-stable@freebsd.org, Kostik Belousov <kib@freebsd.org>, John Baldwin <jhb@freebsd.org>
Subject:   Re: 9-STABLE -> NFS -> NetAPP:
Message-ID:  <1422247357.3019523.1360860105806.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <9A149E78-BB4F-414D-AAE5-331C5934FF82@hub.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Marc Fournier wrote:
> On 2013-02-13, at 3:54 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>=20
> >>
> > The pid that is in "T" state for the "ps auxlH".
>=20
> Different server, last kernel update on Jan 22nd, https process this
> time instead of du last time.
>=20
> I've attached:
>=20
> ps auxlH
> ps auxlH of just the processes that are in TJ state (6 httpd servers)
> procstat output for each of the 6 process
>=20
>=20
>=20
>=20
> They are included as attachments =E2=80=A6 if these don't make it through=
, let
> me know, just figured I'd try and keep it compact ...
Ok, I took a look and the interesting process seems to be 16693. It is
stopped ("T" state) and several of its threads (22, but not all) have
a procstat like this:
16693 104135 httpd            -                mi_switch+0x186 thread_suspe=
nd_check+0x19f sleepq_catch_signals+0x1c5
   sleepq_timedwait_sig+0x19 _sleep+0x2ca clnt_vc_call+0x763 clnt_reconnect=
_call+0xfb
   newnfs_request+0xadb nfscl_request+0x72 nfsrpc_accessrpc+0x1df nfs34_acc=
ess_otw+0x56 nfs_access+0x306
   vn_open_cred+0x5a8 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0=
xf7=20

The sleep in clnt_vc_call is waiting for an RPC reply (while a vnode
lock is held) with PCATCH | PBDRY flags, since it interruptible.

I can see that the thread_suspend_check() has a 1 argument (return_instead =
=3D=3D 1),
since there is only one call to thread_suspend_check() in sleepq_catch_sign=
als().

When looking at thread_suspend_check(), I basically got lost, although it
seems that it can only "return_instead" if there is a single thread and
not multiple threads doing this.

If these threads are stuck here and won't return from msleep(), that would
explain the hang.

If they would wakeup and return from the msleep() when a wakeup occurs, it
would suggest that there is a lost reply or similar, so the wakeup isn't
occurring.

I also don't know if a timeout of the msleep() will still occur and make
the msleep() return?

Although it wasn't done to fix this, it looks like jhb@'s recent patch to
head (r246417) might fix this, since it reworks how STOP signals are handle=
d
for interruptible mounts.

Hopefully kib or jhb can provide more insight.

Btw Marc, if you just want this problem to go away, I suspect getting rid
of the "intr" mount option would do that.

rick




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1422247357.3019523.1360860105806.JavaMail.root>