Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 21 May 2002 15:41:44 -0700 (PDT)
From:      Scott Hess <scott@avantgo.com>
To:        freebsd-net@freebsd.org
Subject:   High volume proxy server configuration.
Message-ID:  <Pine.LNX.4.44.0205211541210.8851-100000@river.avantgo.com>

next in thread | raw e-mail | index | archive | help
Background: I'm working on an intelligent Apache-based proxy server for
backend servers running a custom Apache module.  The server does some
inspection of the incoming request to determine how to direct it, and
passes the reseponse directly back to the client.  Thus, I'd like to be
able to set the TCP buffers fairly large, with the server merely acting as
a conduit to transfer data between the backend server and the client.  
Upstream data is relatively small (a handful of kilobytes), downstream can
be large (100k-2Meg).

Setup: 2x SMP server running FreeBSD4.5.  Apache 1.3.x.  2Gig of memory.

When stress-testing, I am able to cause the kernel messages:

    m_clalloc failed, consider increase NMBCLUSTERS value
    fxp0: cluster allocation failed, packet dropped!

The system hangs for a perhaps five minutes, and then comes back and is
able to continue operating.  pings work, but the console isn't responsive
(I mean "no response until things clear a couple minutes later).  I've
spent some time trying to tweak things, but I haven't been able to prevent
the problem.  My /boot/loader.conf includes:

    kern.maxusers="512"
    kern.ipc.nmbclusters="65536"

The problem can happen at various points.  I've seen it happen with the
mbuf cluster count <1k.  Usually, the current/peak/max of netstat -m will
have peak nowhere near 65536.  This usually happens when I have on the
order of 2000 processes/connections running - the machine is 80% idle at
this point, though.

I wrote a program to specifically use up mbuf clusters (many servers write
lots of data, many clients sleep), and it didn't cause any problems until
hitting the maximum.  Even then, the machine wasn't locked up at the
console.  So I think the message is a symptom of something else.

Here's my theory: When the amount of space used for user processes and
kernel usage fills all of memory, and a burst of packets are received from
the backend servers, the kernel isn't able to allocate pages and drops the
packets, with the message.  The sender resends, and things cascade away.  
Since this is a kernel vm issue, the console also locks up.  [Well, it's
the best I have.]

I've tried upping vm.v_free_min, vm.v_free_target, and vm.v_free_reserved.  
It doesn't appear to have any impact.

I was also getting the message:

   pmap_collect: collecting pv entries -- suggest increasing PMAP_SHPGPERPROC

From what I can tell, this sounds like a direct results of running so many
processes forked from the same parent.  Each process is small (SIZE ~4M).  
I increased PMAP_SHPGPERPROC to 400, now I don't seem to get this message.  
I've watched 'sysctl vm.zone', and the PV ENTRY line seems more
reasonable, now.

The last line of vmstat output when this happens (broadly similar to
previous lines):

 procs      memory      page                    disks     faults      cpu
 r b w     avm    fre  flt  re  pi  po  fr  sr da0 da1   in   sy  cs us sy id
 8 2 0 2141424  41184 8255  46   0   0 3982   0   0   0 3477 5416 1264 14 38 48

This is consistent with top:

last pid: 79636;  load averages:  3.51,  1.59,  0.83    up 0+22:23:16  16:37:14
2268 processes:9 running, 2259 sleeping
CPU states: 19.6% user,  0.0% nice, 19.6% system,  5.4% interrupt, 55.4% idle
Mem: 578M Active, 25M Inact, 361M Wired, 3528K Cache, 112M Buf, 37M Free
Swap: 2048M Total, 35M Used, 2012M Free, 1% Inuse

[Hmm, one note - I'm replicating this on a 1Gig machine, but we've also
seen it in an extreme case on the 2Gig machine which is in production.]

Hmm.  vmstat just came back, the first two lines:

 procs      memory      page                    disks     faults      cpu
 r b w     avm    fre  flt  re  pi  po  fr  sr da0 da1   in   sy  cs us sy id
2268 2 0 2352192  62236 7308  59   0  32 3306 161397  28   0 2454 5111 1153 12 40 49
 0 2 0  266364  46240 292845 1517   9 608 38036 6843317   1   0 334730 253302 17192  0 100  0

top shows increased space used in swap (42M, now), so it looks like we got
a bunch of swapping going on.  [Just to be clear - when the event happens,
things don't simple get a bad response time.  There's _no_ response, until
the problem clears and everything comes back.  Then it's all shiny-happy,
again.]

/etc/sysctl.conf has:

    kern.ipc.somaxconn=4192
    net.inet.ip.portrange.last=40000
    kern.ipc.maxsockbuf=2097152

We are definitely not using the full maxsockbuf range!  Actually, we've
left things at the default (sendspace=32k, recvspace=64k).

AFAICT, everything else is at default settings.

Thanks for any help,
scott


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.LNX.4.44.0205211541210.8851-100000>