Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 1 May 2014 09:18:56 -0700
From:      David Wolfskill <david@catwhisker.org>
To:        fs@freebsd.org
Subject:   SU+J: 185 processes in state "suspfs" for >8 hrs. ... not good, right?
Message-ID:  <20140501161856.GH1120@albert.catwhisker.org>

next in thread | raw e-mail | index | archive | help

--T1STeyE7BzLxwmnN
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

I'm probably abusing things somewhat, but limits are to be pushed,
yeah...? :-}

At work, we have some build servers, presently running FreeBSD/amd64
stable/9 @r257221.  They have 2 "packages" with 6 cores each (Xeon(R)
CPU X5690  @ 3.47GHz); SMT is enabled, so the scheduler sees 24
cores.  The local "build space" is a RAID 5 array of 10 2TB drives
with a single UFS2+SU file system on it (~15TB).  The software
builds are performed within a jail (that is intended to look like
FreeBSD/i386 7.1-RELEASE).

My test workload is to:
* create a "sandbox" (by checking out the sources).
* tar up the sandbox (for future iterations).
* Iterate over:
  - Blow away the old sandbox
  - Unpack a new sandbox from the tarball
  - Enter the sandbox & perform a timed software build
  - exit the sandbox & record the results (exit status; elapsed time, &c.)

For the "blow away the old sandbox" step, I used to just use "rm
-rf" -- but because the sandbox is ... large, that tends to be
rather on the slow side.  So I cobbled up a shell script that
essentially does:

   max_proc=3D$( sysctl -n kern.maxprocperuid )
   max_proc=3D$(( $max_proc / 2 ))
   for sb in $@; do
     find $sb -type d -depth +$depth -prune -print0 | \
       xargs -0 -n 1 -P $max_proc rm -f -r &
     wait
     rm -fr $sb
   done

which tends to be faster, as the process is parallelized (vs., I
suppose, "paralyzed" :-}).

I have the use of a designated 'test machine," which I subject to my ...
experiments.

Based on various other events, as well as at least one suggestion from a
colleague, I thought I'd try turning on soft updates journaling -- so I
did.

My first set of tests were inconclusive -- I saw the load averages
increase quite a bit (from a max of ~18 to ~25); some build times were
around the same, while one was quite a bit longer.  It's possible that
one of my colleagues was doing something on the machine, though I had
tried to let them know that I was running timing tests and things were a
bit more "experimental" than usual.

So I fired off another round of tests yesterday evening.

This morning, load average was around 0 (and plenty of time had
elapsed), so I thought maybe the 2nd round of tests had completed.

I was mistaken in that belief. :-/

On resuming the tmux session, I found:

=2E.. [SU+J], iteration 5, terminated status 0 at Thu May  1 00:01:06 PDT 2=
014

So I hit ^T and saw:

load: 0.15  cmd: rm 73825 [suspfs] 16479.66r 0.00u 1.38s 0% 1436k

Hmmm...  So I waited a few minutes and hit ^T again and ... nothing.
No response.  Hmm...  I was able to get a response from another session
within the jail, as well as on the host.

It's now been almost 5 more hours after that, and logging in to the
host, I see:

test-machine(9.2)[6] ps axwwl | grep -cw suspfs
185
test-machine(9.2)[7]=20

I suspect that there may have been a bad interaction between what
I was doing and some cron-initiated activity that started just after
midnight.  But this particular mode of failure seems .... well,
"graceless" comes to mind (if I want to be quite charitable).

I think I'd like to file a PR on this, but I'd like to provide some
decent information with the PR -- and I'd like to be able to do
Other Things with the test machine in the interim.

So: does anyone have suggestions for information I might gather while
the machine is in this state that might help figure out a way to
prevent a recurrence?

Thanks!

[Please note that I've set Reply-To to include both the list and
me, as I'm not subscribed.]

Peace,
david
--=20
David H. Wolfskill				david@catwhisker.org
Taliban: Evil cowards with guns afraid of truth from a 14-year old girl.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.

--T1STeyE7BzLxwmnN
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (FreeBSD)

iQJ8BAEBCgBmBQJTYnPtXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w
ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXQ4RThEMDY4QTIxMjc1MDZFRDIzODYzRTc4
QTY3RjlDOERFRjQxOTNCAAoJEIpn+cje9Bk7M/AP/3RvXjTi020uxHOYejS6j06q
JbFOb1Zx+TUre5wneRpwugA/1QFKWHYdqdkLPaN9Wvo2vPMcozw+kaT03/nVR6vK
CPbY6YMFu0ZkSgJxXYGiEV1+p5IU8uEfUN/28I4ih50LhZRba/b+B5/rJ0dKhJeB
r/ArgVKvLNxojnIjo4v6pd5UOTzXGq47/IoQU2AGk3n/KEEWGMaLdl1JXixHdybm
GRVazAGW52Oacx1SvlN0+6JlXob+D3fnfMvEbpASg/Wc0z2Ap8nhpDcHdL8irgMv
sWmSd7nzRVW1OUU10GyXEsrViYce3sBwxQquV5HJzp915Pwls9J/9M2VplT0Blet
al/UNwtlAeSY9uV3RVVOFSCeSyX+KiYnhRLyeQ5Yx4pEpFacH1lzoxLs7bwJmIsU
2isLa7TQybanqT0tCJA9IZrSQbJCvJbm9igbO3JzQI3i7mkfTwGP7Y8T4zt9TxrS
OA9VxVhotYjUfcz67/JUnGpxUuJsHOcwiXmDETaOdIohFpjct50QVjJeirR3Diee
0TtOfwcqyxBrqPugOrMHkL04vQVyLnUPoHoee8tN2Ng9+k/ai+H+jn6ncEXq5hU3
FRQ8jQG/jp9y4mvpf9cHq5wdQOwG6IL6y0QypgzrkMeCBBuu+dta+0AYg8T03adi
gXZ4mvpNaLwjLdcIibZQ
=mzUh
-----END PGP SIGNATURE-----

--T1STeyE7BzLxwmnN--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140501161856.GH1120>