From owner-freebsd-net@FreeBSD.ORG  Tue Dec 18 12:37:02 2007
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5481216A419;
	Tue, 18 Dec 2007 12:37:02 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail11.syd.optusnet.com.au (mail11.syd.optusnet.com.au
	[211.29.132.192])
	by mx1.freebsd.org (Postfix) with ESMTP id D586C13C469;
	Tue, 18 Dec 2007 12:37:01 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au
	[211.30.219.213])
	by mail11.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	lBICapqd006882
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 18 Dec 2007 23:36:53 +1100
Date: Tue, 18 Dec 2007 23:36:50 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Scott Long <scottl@samsco.org>
In-Reply-To: <47676E96.4030708@samsco.org>
Message-ID: <20071218233644.U756@besplex.bde.org>
References: <D50B5BA8-5A80-4370-8F20-6B3A531C2E9B@eng.oar.net>
	<20071217103936.GR25053@tnn.dglawrence.com>
	<20071218170133.X32807@delplex.bde.org>
	<47676E96.4030708@samsco.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-net@FreeBSD.ORG, freebsd-stable@FreeBSD.ORG,
	David G Lawrence <dg@dglawrence.com>
Subject: Re: Packet loss every 30.999 seconds
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Dec 2007 12:37:02 -0000

On Mon, 17 Dec 2007, Scott Long wrote:

> Bruce Evans wrote:
>> On Mon, 17 Dec 2007, David G Lawrence wrote:
>> 
>>>   One more comment on my last email... The patch that I included is not
>>> meant as a real fix - it is just a bandaid. The real problem appears to
>>> be that a very large number of vnodes (all of them?) are getting synced
>>> (i.e. calling ffs_syncvnode()) every time. This should normally only
>>> happen for dirty vnodes. I suspect that something is broken with this
>>> check:
>>> 
>>>        if (vp->v_type == VNON || ((ip->i_flag &
>>>            (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
>>>             vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
>>>                VI_UNLOCK(vp);
>>>                continue;
>>>        }
>> 
>> Isn't it just the O(N) algorithm with N quite large?  Under ~5.2, on

> Right, it's a non-optimal loop when N is very large, and that's a fairly
> well understood problem.  I think what DG was getting at, though, is
> that this massive flush happens every time the syncer runs, which
> doesn't seem correct.  Sure, maybe you just rsynced 100,000 files 20
> seconds ago, so the upcoming flush is going to be expensive.  But the
> next flush 30 seconds after that shouldn't be just as expensive, yet it
> appears to be so.

I'm sure it doesn't cause many bogus flushes.  iostat shows zero writes
caused by calling this incessantly using "while :; do sync; done".

> This is further supported by the original poster's
> claim that it takes many hours of uptime before the problem becomes
> noticeable.  If vnodes are never truly getting cleaned, or never getting
> their flags cleared so that this loop knows that they are clean, then
> it's feasible that they'll accumulate over time, keep on getting flushed
> every 30 seconds, keep on bogging down the loop, and so on.

Using "find / >/dev/null" to grow the problem and make it bad after a
few seconds of uptime, and profiling of a single sync(2) call to show
that nothing much is done except the loop containing the above:

under ~5.2, on a 2.2GHz A64 UP ini386 mode:

after booting, with about 700 vnodes:

%   %   cumulative   self              self     total 
%  time   seconds   seconds    calls  ns/call  ns/call  name 
%  30.8      0.000    0.000        0  100.00%           mcount [4]
%  14.9      0.001    0.000        0  100.00%           mexitcount [5]
%   5.5      0.001    0.000        0  100.00%           cputime [16]
%   5.0      0.001    0.000        6    13312    13312  vfs_msync [18]
%   4.3      0.001    0.000        0  100.00%           user [21]
%   3.5      0.001    0.000        5    11321    11993  ffs_sync [23]

after "find / >/dev/null" was stopped after saturating at 64000 vnodes
(desiredvodes is 70240):

%   %   cumulative   self              self     total 
%  time   seconds   seconds    calls  ns/call  ns/call  name 
%  50.7      0.008    0.008        5  1666427  1667246  ffs_sync [5]
%  38.0      0.015    0.006        6  1041217  1041217  vfs_msync [6]
%   3.1      0.015    0.001        0  100.00%           mcount [7]
%   1.5      0.015    0.000        0  100.00%           mexitcount [8]
%   0.6      0.015    0.000        0  100.00%           cputime [22]
%   0.6      0.016    0.000       34     2660     2660  generic_bcopy [24]
%   0.5      0.016    0.000        0  100.00%           user [26]

vfs_msync() is a problem too.  It uses an almost identical loop for
the case where the vnode is not dirty (but has a different condition
for being dirty).  ffs_sync() is called 5 times because there are 5
ffs file systems mounted r/w.  There is another ffs file system mounted
r/o and that combined with a missing r/o optimization might give the
extra call to vfs_msync().  With 64000 vnodes, the calls take 1-2 ms
each.  That is already quite a lot, and there are many calls.  Each
call only looks at vnodes under the mount point so the number of mounted
file systems doesn't affect the total time much.

ffs_sync() i taking 125 ns per vnode.  That is a more than I would have
expected.

Bruce