Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Feb 2013 20:50:39 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Marc Fournier <scrappy@hub.org>
Cc:        Kostik Belousov <kib@freebsd.org>, freebsd-stable@freebsd.org, John Baldwin <jhb@freebsd.org>
Subject:   Re: 9-STABLE -> NFS -> NetAPP:
Message-ID:  <339364797.2960794.1360720239431.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <61DAA500-EB20-4861-AA7F-402FF1047B81@hub.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Marc Fournier wrote:
> Just reset server, so any further details will have to be 'next time'
> =E2=80=A6 but, just did a csup and am rebuilding =E2=80=A6 the following =
three files
> were modified since last build:
>=20
> grep nfs /tmp/output
> Edit src/sys/fs/nfs/nfs_commonsubs.c
> Edit src/sys/fs/nfsclient/nfs_clrpcops.c
> Edit src/sys/fs/nfsserver/nfs_nfsdserv.c
>=20
>=20
> On 2013-02-10, at 4:56 PM, Marc Fournier <scrappy@hub.org> wrote:
>=20
> >
> > On 2013-02-10, at 4:31 PM, Rick Macklem <rmacklem@uoguelph.ca>
> > wrote:
> >
> >> Marc Fournier wrote:
> >>> Hi John =E2=80=A6
> >>>
> >>> Does this help?
> >>>
> >>> root@io:~ # ps auxl | grep du
> >>> root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799
> >>> 0
> >>> 81426 0 20 0 newnfs
> >>> root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx
> >>> /vm/2799 0
> >>> 91597 0 20 0 newnfs
> >>> root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx
> >>> /vm/2799 0
> >>> 43227 0 20 0 newnfs
> >>> root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847
> >>> 0 20
> >>> 0 piperd
> >> It is probably too late, but all the lines (without the | grep du)
> >> would be
> >> more useful. I also include the "H" flag, so it lists threads as
> >> well as
> >> processes. The above just says the "du" command is waiting for a
> >> vnode lock.
> >> The interesting process/thread is the one that is holding a vnode
> >> lock
> >> while waiting for something else.
> >
> > As requested, 'ps auxlH' attached =E2=80=A6
> >
> >
> > <ps.out.bz2>
> >
Well, I took a look at the ps output and I didn't see anything that would
identify what the hang is. There are a lot of processes sleeping on "newnfs=
"
(waiting for a vnode lock) and many sleeping on "vofflock" (waiting for the
 f_offset lock).

Unfortunately, I can't spot any process/thread that is blocked on something
else, where it would seem likely to be holding either an nfs vnode lock or
f_offset lock that isn't one of these.

There were changes about 5 months ago which it appears fixed a deadlock rac=
e
between vnode locks and offset locks for paging (r236321 and friends).

I am wondering if there could be other similar races, possibly specific to
paging in over NFS? (I can't see any case where there is a LOR, so I can't
think of what it might be?)

If you just want the hangs to go away, I'd suggest moving the executable
is /usr/local/sbin (httpd maybe) to a local file system on the server,
since it does seem to be related to paging this executable in over NFS.

rick
ps: I've added kib@ to the cc, in case he is aware of other related races?

> >>
> >> Are you still getting the:
> >> nfs_getpages: error 13
> >> vm_fault: pager read error, pid 11355 (https)
> >
> > Fairly quiet:
> >
> > <Screen Shot 2013-02-10 at 4.43.55 PM.png>
> >
> > And that is it since last reboot ~20 days ago =E2=80=A6
> >
> >>
> >> messages logged?
> >>
> >> With John's recent patch, the error# would no longer be 13 if it
> >> was
> >> caused by the "intr" flag resulting in a Read RPC terminating with
> >> EINTR.
> >> If you are still getting the above with "error 13", it suggests
> >> that
> >> the server is replying EACCES for the Read RPC.
> >> I suggested before that you check to make sure that the executable
> >> had
> >> read access for everyone one the file server. Since I didn't hear
> >> back,
> >> I'll assume this is the case.
> >
> > Don't understand this question =E2=80=A6 I have 34 VPSs running off of =
this
> > server right now =E2=80=A6 that 'du process' runs against each of those=
 VPSs
> > every night, and this problem started happening on Friday night's
> > run =E2=80=A6 ~18 days into uptime =E2=80=A6 so the same process has ru=
n repeatedly,
> > with no issues, 18 times before it hung on Friday =E2=80=A6 also, the h=
ang,
> > once 'triggered', only seems to recur against the same directory =E2=80=
=A6
> > the same directory doesn't necessarily trigger it, but once it
> > starts, it appears to do it for the same directory =E2=80=A6 I'm not su=
re if
> > I've ever seem it happening to two different directories at the same
> > time =E2=80=A6
> >
> > Also, please note that the du command is run from the physical
> > server, as root =E2=80=A6
> >
> >> rick
> >> ps: If it is still up and hasn't been rebooted, you could:
> >>   sysctl debug.kdb.break_to_debugger=3D1
> >>   - then type <ctrl><alt><esc> at the console and do the following
> >>     from the debugger
> >>   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook=
/kerneldebug-deadlocks.html
> >>   How well this work depends on what options your kernel was built
> >>   with.
> >
> > My remote console on that one doesn't work very well =E2=80=A6 I can vi=
ew,
> > but I can't type =E2=80=A6
> >
> >
>=20
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
> "freebsd-stable-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?339364797.2960794.1360720239431.JavaMail.root>