Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 1 Feb 2014 20:31:12 +0100
From:      Matthew Rezny <matthew@reztek.cz>
To:        freebsd-stable@freebsd.org
Subject:   Processes hang in state "kmem a", system hang follows
Message-ID:  <20140201203112.0000210c@unknown>

next in thread | raw e-mail | index | archive | help
I'm seeing rather strange behavior from 10.0 on i386 thus far. This is
another long message, so if you want the summary without back-story,
skip to the end. Sometimes it's hard to include relevant details without
feeling like I'm rambling. I'm seeing rather strange behavior from 10.0
on i386 thus far.

I started with FreeBSD not long before 4.0 release and ran 4.x releases
on i386 and Alpha for a long time. I tried the 5.x releases and had
nothing but trouble so stuck with 4.x through that time. The Alpha
never did move off 4.x before it got retired, but some of my i386 boxes
made it onto 6.x and then sat there until they were taken out of active
use. For years, FreeBSD 4.x and 6.x was the reliable OS I used for
everything but my desktop (which had been OS X).

More recently I started using FreeBSD 8 on amd64 with ZFS and quickly
moved on to 9 as soon as 9.0 was released. At the same time, i386
hardware retired from desktop roles but suitable for network services
got 8.x installed on UFS. I had rather good experience with 9-STABLE on
amd64 running with ZFS. For the most part it's solid, ZFS support is
much better than the sorry state Apple left it in before abandoning it
on OS X, though I did get a few kernel panics when simply connecting
disks that contained zpools from OS X. Due to both compilation speed
difference and the fact older hardware tends to be in more entrenched
roles, I left my i386 systems out of the ZFS and 9.x experiments. I
did also try 9.x on my one ppc64 box at various times to see if that
might be a good way to utilize hardware Apple dropped support for years
prior. The state on ppc64 varied between panic on boot to being able to
buildworld but an idle system left for a few days would randomly go
zombie, console freezes but clearly there is some system activity and
it responds to ping but might not take a ssh connection, which I
chalked up to the experimental state of the port. I did see console
freezes on i386 boxes booted from a 9.1 mfsbsd image but never
investigated because I was just using it to image and erase disks on
old machines where I considered the hardware suspect.

In the last couple months I've been moving my amd64 systems to 10,
starting during the RCs and keeping up such that they are now all
10-STABLE. The transition was fairly smooth and they are running quite
well. Even one box that has prior chipset and BIOS, which was panicking
with an early 10-BETA is now running 10.0-RELEASE with KMS. All very
impressive. So, time to start migrating some i386 boxes I figure. I had
recently moved a number of them to 9.2 and figured I should just go
ahead and move everything up to 10.0 at close to the same time if
possible. I had seen no problems with 9.2 or 9-STABLE on the i386 boxes
that I was preparing to upgrade, I already sorted out one Clang bug
that affected a few (but less worse than a similar GCC bug that remains
unfixed) since I had switched compilers when going to 9.

Since I started moving i386 boxes to 10.0, I've had nothing but strange
problems. Last night I wrote a message about kern.maxswzone, something
I started getting warnings about on one particular box when I put 9.2 on
it but which I didn't try to do anything about until now. I wrote that
message with this one in mind, mentioning that I would have another
about processes hanging. That one came first because it has at least
some hard numbers and not so much subjective feelings of performance
and reliability. Between then and now, the pattern struck me, all my
early successes with 10 were amd64, and now all the i386 boxes I've
upgraded are barely functional.

I have 4 i386 boxes that I tried to put 10.0 on in the past week with
various degrees of fail. There are 2 sets within the four, two are the
low-end C3 boxes with 256MB and 384MB RAM described in my prior message
to the list. The other two are Pentium4 systems, one with 2GB RAM and
the other with 3GB, substantially bigger disks, decent GPU, etc. In
other words, two are ancient and two are merely a little dated but
still very usable. This faster pair I will mention first, then I will
return to the slow pair. All these boxes are things I use around
the house for network services or as essentially terminals in other
rooms (kitchen pc to look up stuff, bedroom pc to watch movies, etc).
The i386 boxes that run important services (externally facing network
services, routing/firewall, etc) and being left two a second round once
all issues are sorted out on these lower-importance boxes first.

The P4s had 9-STABLE installed on UFS volumes. I did the switch from
csup to svnup to pull the 10.0 sources, did the buildworld/kernel and
install on both and all looked good. Before I went on to reinstall
packages or anything else, I decided now might be a good time to try
switching from UFS to ZFS, everything in /home was already backed up. So
far I had only tried ZFS on amd64 due to early reports of flakiness on
i386 related to exhausting kernel memory. In the couple years since
initial support, the ZFS code has gotten better integrated, more people
have tried it, some tuning guides have been written, and I've seen
reports of it being used on boxes with 512MB RAM. Most of my i386 boxes
in server roles have 2GB and it would be nice to migrate those to ZFS
if possible. Best to test on these boxes first and try tuning if needed.

I booted both P4 boxes from mfsbsd CD, mounted the existing UFS volums,
tar the whole mess and drop the uncompressed tar on my file server. On
the server, I fired off xz to compresses the tar file to speed the
restore (or so I thought) while I prepared the machines. I setup the
zpools in the normal way I'd done all my amd64 boxes. One P4 box has a
single disk, the other has two, so one is a single vdev pool and the
other is multiple, which adds a little variety for testing. Aside from
vdevs, the pool properties, filesystems and their properties are all
identical to how I've been setting up my other ZFS boxes. LZ4 on most
filesystems, gzip or none on a few, sha256 hashes entirely, no dedupe,
pretty normal. With the pools configured and mounted on /zroot, I scp
the tar.xz file for each box into /tmp (which is tmpfs), and try tar
xjpvf in /zroot.

After initial good progress, both boxes seemed to hang at about the
same time. Disk activity stops, tar is sitting there as if it's going
to do something, but no further progress on either when left for an
hour. I started top on both boxes and notice that the tar process on
each is in the state "kmem a" and the resident memory allocation on
each is exactly the same (around 750MB). My first thought was that I
used too much RAM with the 500MB tar.xz file in tmpfs. One box says
800MB free and the other says 1800MB free but maybe there is a shortage
of kernel memory. I can't seem to kill tar, so I just reboot each,
clear the zpools to try from a fresh state again, mount the swap before
filling /tmp this time, then attempt another extract. No joy, it stops
the same way, with the exact same memory allocation, and each box is
stopped on the exact same file as where each stopped on the first
attempt. The free memory reports are the same as before, no sawp is
being used, whatever is running out must be non-pageable.

The next thing I try is decoupling the stages. The tar process is
growing so large because it has to decompress lzma which requires a
huge dictionary. I figure maybe the heavy disk I/O is causing
buffers/cache to contend with the process in some way. Reboot again for
a fresh start, scp the .tar.xz to /zroot/tmp, xz -d so it's just a
plain tar, then tar xpvf in /zroot and both complete without error.
Set the mointpoint to / for each zroot and reboot into the running
system. That was strange but solvable. I don't know what the "kmem a"
state is but I can guess it's probably short for something like "kmem
alloc" which would suggest to me the process is waiting on a kernel
allocation. So I figure I've got some tuning to do and a hung process
isn't as bad as the kernel panics others had reported on i386 under
heavy I/O load (e.g. rsync) with default settings. After all, the boot
messages include two warnings about tuning ZFS memory on i386. In order
to do the tuning, I need some reproducible load, and buildworld is good
for that. So, first thing is switch from svnup to svnlite that is now in
base and use that to get 10-STABLE sources. I do the rm -r on /usr/src
and /usr/ports and then fire off the svnlite co for each. I find that
the slowness of svn checkout is due to network latency and running the
two in parallel doesn't create I/O contention on either disk or network.

While the P4s are fetching their sources, I go to deal with the pair of
Via C3 boxes that I had taken to 10-PRERELEASE just a week prior and
was ready to upgrade to 10-STABLE. Since that upgrade, they sat unused
waiting for an impending MFC so I could do away with a local patch. As
mentioned in my other message, I made a mistake here on my first
attempt, I forgot to clear the existing /usr/src and /usr/ports before
starting the svnlite checkout. After realizing my mistake, I did the
now larger (as it includes a .svn dir) rm -r of those dirs to start
fresh. That's when I hit the problem with rm hanging on one box.
Without repeating all the details, I had to boot mfsbsd to do the rm on
the one box with only 256MB RAM, but what difference that made is simply
inexplicable. Once I had gotten that straightened out, I started off
the svnlite checkout fresh. On the box with 384MB, the completed with
only one restart for network dropout (common since it takes 2-3 hours
per checkout). On the box with 256MB (which had previously fully
checked out and gotten to the point where it wanted to prompt me for
the conflict on every file in the tree), svnlite could only do a
hundred files or so before it seemed to hang in the same way as rm.
Running just one instance on /usr/src without the parallel checkout
on /usr/ports made no difference. When rm was hanging, I might be able
to kill it (after several minutes wait) and reboot or the console might
lock. When svnlite hung, I could not login but I might be able to run a
command on another VT. I was able to catch that svnlite is getting
stuck in the state "kmem a". Hmmm... the same state that tar was
getting stuck in on the other boxes. How were those doing now?

I look back at the P4s, which should be done as it's been a few hours
spent on the C3 boxes. They are sitting there  in the middle of
checkout not making any visible progress. Ctrl-c doesn't work, I can't
switch VTs, even ctrl-alt-del seems to not work. Seems like the
consoles are hung in a way eerily similar to what I'd seen from 9.x on
non-amd64 platforms (both ppc64 and i386). I attempted to initiate an
ssh connection into each of the P4s and then walked off for a minute
for refreshment. When I came back, expecting to find a login prompt or
a timeout, I found the ssh attempts timed out and the two boxes had
rebooted. I don't know if the ctrl-alt-del finally registered or if the
incoming ssh connection pushed them over the edge. I wasn't there to
see and the logs for both stop sometime before the hang. With both
rebooted, I do a svnlite cleanup in /usr/src and /usr/pots or both,
then fire off the svnlite co for each directory on both boxes.

While those were running, I started digging into the kern.maxswzone
tunable on the C3 box with less RAM. The box with more RAM was able
to do the rm, svn checkout of both src and ports in parallel, and showed
no obvious sign of trouble, though I hadn't started a buildworld yet.
The box with less RAM was failing all over the place and the only
obvious difference was the warning about that tunable. After I wasted
hours figuring out the value is already sufficient but is apparently
reduced after it's set, so it can't be effectively turned up, only
down, I wrote my previous message to this list on that topic
specifically and then went to bed.

This morning I got up and was already thinking about the correlation,
that 10 is a disaster on all my i386 boxes thus far. The first thing I
checked was the P4 boxes. Both completed the svn checkout on both src
and ports, good sign. However, the box with 3GB RAM has the message
"vm_thread_new: kstack allocation failed" repeated about a dozen times,
bad sign. First thing I do is try to run top to see what the size of
ARC is, free RAM, etc. "No more processes." Uh Oh, that's no good at
all, can't even run top. Curiously, the box with less RAM, only 2GB,
has no messages so I try to start top on it to see what it's state is.
Nothing happens when I push return, the cursor is just sitting there
after top. On another VT, reboot gets the same response, none, cursor
just sits. I can't type but I can switch VTs and scroll, until I do
ctrl-alt-del, then every key press after that is a beep. Back on the
once that said no processes left for top, reboot gets the same
non-response. ctrl-alt-del doesn't beep, it just spits out the ^[[3~
typical of a dead console. Ugh, not even a reset button to punch on
these P4 boxes.

So, svnlite checkout is a real strain that can bring a system to it's
knees. I'm not sure if this should be regarded as horrible inefficiency
or as a means of checking the box before launching into a buildworld
(as if that wasn't enough strain to uncover most problems). While 10.0
is good on amd64, it seems a disaster on i386. Processes hang in this
"kmem a" state it doesn't take much more to get the box to livelock.
I've only seen the "kmem a" state a few times as most other times I
can't inspect anything before the box is locked too hard to do
anything. In some cases I'm not sure there's even a way to get the box
shutdown clean as the most trivial of things lock it up hard. It's
not even required to do anything. When I was experimenting with
kern.maxswzone last night I rebooted one box a few dozen times, so
if I didn't need to look at systcl output I just hit ctrl-alt-del at
the login prompt. Once the console died right then, it had just
booted and ctrl-alt-del was met with a beep and then it's hung, have to
punch reset. I'm guessing the console dies as a result of total wedging
of I/O systems following heavy disk I/O. The cause is not just ZFS
because the C3 boxes are UFS. The problem is not just the excess swap
on the smallest box because I see the same sort of troubles on the box
with the most RAM. Some kernel resource seems to be exhausted
regardless of how much RAM or swap is present. 

I'm going to try buildworld on 3 of these to see what happens. For the
fourth, I still need to get sources onto the disk before I can even
attempt that. I'm not sure what to expect. It might be instant
miserable failure, or it might actually run a long time since the I/O
load is in bursts with lots of recovery time between. It'll take a few
hours to see if the P4s succeed. It'll take two days to see a C3
succeed. Maybe by that time, someone will get through all I've written
and have some useful suggestion for debugging. To me, it's rather hard
to debug since I have little hint where to start, when the problem
manifests any logging stops, and the box is in a state where it is
essentially unobservable without a JTAG to jump in and directly inspect
the state of it's world.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140201203112.0000210c>