Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Dec 2007 04:08:09 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Kostik Belousov <kostikbel@gmail.com>
Cc:        "Freebsd-Net@Freebsd. Org" <freebsd-net@FreeBSD.org>, freebsd-stable@FreeBSD.org
Subject:   Re: Packet loss every 30.999 seconds
Message-ID:  <20071223032944.G48303@delplex.bde.org>
In-Reply-To: <20071222050743.GP57756@deviant.kiev.zoral.com.ua>
References:  <20071221234347.GS25053@tnn.dglawrence.com> <MDEHLPKNGKAHNMBLJOLKMEKLJAAC.davids@webmaster.com> <20071222050743.GP57756@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 22 Dec 2007, Kostik Belousov wrote:

> On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote:
>>
>> I'm just an observer, and I may be confused, but it seems to me that this is
>> motion in the wrong direction (at least, it's not going to fix the actual
>> problem). As I understand the problem, once you reach a certain point, the
>> system slows down *every* 30.999 seconds. Now, it's possible for the code to
>> cause one slowdown as it cleans up, but why does it need to clean up so much
>> 31 seconds later?

It is just searching for things to clean up, and doing this pessimally due
to unnecessary cache misses and (more recently) introduction of overheads
to handling the case where the mount point is locked into the fast path
where the mount point is not unlocked.

The search every 30 seconds or so is probably more efficient, and is
certainly simpler, than managing the list on every change to every vnode
for every file system.  However, it gives a high latency in non-preemptible
kernels.

>> Why not find/fix the actual bug? Then work on getting the yield right if it
>> turns out there's an actual problem for it to fix.

Yielding is probably the correct fix for non-preemptible kernels.  Some
operations just take a long time, but are low priority so they can be
preempted.  This operation is partly under user control, since any user
can call sync(2) and thus generate the latency every <latency> seconds.
But this is no worse than a user generating even larger blocks of latency
by reading huge amounts from /dev/zero.  My old latency workaround for
the latter (and other huge i/o's) is still sort of necessary, though it
now works bogusly (hogticks doesn't work since it is reset on context
switches to interrupt handlers; however, any context switch mostly fixes
the problem).  My old latency workaround only reduces the latency to a
multiple of 1/HZ, so a default of 200 ms, so it still is supposed to allow
latencies much larger than the ones that cause problems here, but its
bogus current operation tends to give latencies of more like 1/HZ which
is short enough when HZ has its default misconfiguration to 1000.

I still don't understand the original problem, that the kernel is not
even preemptible enough for network interrupts to work (except in 5.2
where Giant breaks things).  Perhaps I misread the problem, and it is
actually that networking works but userland is unable to run in time
to avoid packet loss.

>> If the problem is that too much work is being done at a stretch and it turns
>> out this is because work is being done erroneously or needlessly, fixing
>> that should solve the whole problem. Doing the work that doesn't need to be
>> done more slowly is at best an ugly workaround.

Lots of necessary work is being done.

> Yes, rewriting the syncer is the right solution. It probably cannot be done
> quickly enough. If the yield workaround provide mitigation for now, it
> shall go in.

I don't think rewriting the syncer just for this is the right solution.
Rewriting the syncer so that it schedules actual i/o more efficiently
might involve a solution.  Better scheduling would probably take more
CPU and increase the problem.

Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is
needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH().
There are 4 places in vfs and 13 places in 6 file systems:

% ./ufs/ffs/ffs_snapshot.c:	MNT_VNODE_FOREACH(xvp, mp, mvp) {
% ./ufs/ffs/ffs_snapshot.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ffs/ffs_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ffs/ffs_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./fs/msdosfs/msdosfs_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./fs/coda/coda_subr.c:	MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./gnu/fs/ext2fs/ext2_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./gnu/fs/ext2fs/ext2_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_default.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_subr.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_subr.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./nfs4client/nfs4_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./nfsclient/nfs_subs.c:	MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./nfsclient/nfs_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {

Only file systems that support writing need it (for VOP_SYNC() and for
MNT_RELOAD), else there would be many more places.  There would also
be more places if MNT_RELOAD support were not missing for some file
systems.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071223032944.G48303>