Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Jan 97 23:09:29 +0000
Subject:   Re: IPFW + Samba -> performance problem
Message-ID:  <"45f6-970114231001-B849*/G=Andrew/S=Gordon/O=NET-TEL Computer Systems Ltd/PRMD=NET-TEL/ADMD=Gold 400/C=GB/"@MHS>
In-Reply-To: <>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help
> This same server dial out with PPP. One day I got a fit of paranoia, and
> decided to install ipfw to throw away packets coming from the net.  The
> firewalling worked, performance for reads from Samba is the same as ever,
> but performance for writes dropped from well above 500KB/s to approx 20KB/s
>  (25-fold).

BTW, if you're using /sbin/ppp and your firewalling requirements are
simple, you may find that the ppp daemon's built-in packet filters
are adequate for your purposes (set ifilter xxxx etc.).  This would
avoid needing IPFW in the kernel.

> Has anybody got a clue?  Because, in this case, I haven't.  (A hyopthesis
> is that something might happen to the TCP_NODELAY option when firewalling
> is enabled, but this sounds kind of unlikely.)

Maybe not so unlikely, though if so it is a need for TCP_NODELAY at
the client end (if my understanding is right).

I haven't hit this exact problem, but I did spend a long time looking
at tcpdump output some while ago to explain variable _read_ performance
we were seeing - all the old client machines (mostly 486s) had been working
fine, but a new P120 client was much slower than the other machines
at reading from Samba.  It turned out that TCP_NODELAY was the solution
(and at the time the FreeBSD port of Samba was missing a #include
so that the -O TCP_NODELAY option didn't work!).  Perhaps an explanation
of what I found will help diagnose your problem.

The SMB protocol is request/response: over a single TCP connection,
client sends a request and waits for a response to come back, with the
next request not being issued until the previous response has been
completely received [I don't think this is a protocol restriction,
but in practice a single-user client doesn't know what to do next
until the previous block has come in].

In the case of a read request, the request is small, and the response
can be of variable size; but when loading .EXE files (the main benchmark
in real life) the reads seem to come in about 5K blocks.
If Samba generated the result in a single write()/writev() call 
on the socket there would be no problem, but in fact it does a
number of small write() calls [presumably to handle the case of
really big reads??].

The result is going to be packetized by TCP for transmission,
and so you have the Windows read size, the block size used by
Samba for its write() calls, the TCP MTU size, and the socket write
buffer size all interacting to control what happens - and all are
arbitrary numbers which don't fit in convenient multiples. In particular,
the write() size is typically one-and-a-bit times the TCP MTU.
Also, the reads are typically not aligned to filesystem blocks.

Now, suppose that the first few write() calls were made very quickly,
but there is a small delay (perhaps reading the disc) before the
last write().  It is extremely unlikely that the sum of all the write()
calls is an exact multiple of the MTU size, so the data will get
transmitted in a few full-size packets and a small one.
The last write() now happens, and since the read transactions for
loading .EXE files seem to be a mixture of sizes, the overall read
is not a multiple of Samba's write() size - so the last write()
will be shorter than the other ones.  If you are unlucky, the
last write() will be less than the TCP MTU size.

At this point, the TCP Nagle algorithm comes into play.  This says
that the transmitter should not transmit another 'short' (ie. < MTU size)
packet when there is already a short packet unacknowledged.
However, if the outstanding packet(s) are less than the window
size, the receiving end implements delayed acknowlegements and will
wait 200ms in case there is data going the other way that can
carry the acknowledge (or in case more data arrives).  As already
noted, neither of those things is going to happen in this case,
so nothing happens until the delayed ack timer goes off, the ack
is transmitted and the last piece of the transaction can be sent.

So, if the numbers happen to stack up against you, there is a 200ms
delay per SMB read transaction - if the transactions are 5Kbyte,
this means only 25Kbyte/sec.  The whole thing is _very_ sensitive
to a large number of variables - if the network is busy, the
nic is slow, or (window size permitting) the client is slow to
ack the first window full of data, then the servers buffers
never drain and the problem doesn't happen.

Of course, setting TCP_NODELAY disables the Nagle algorithm
and so the problem doesn't happen [IMHO, this should really be
hard-wired in Samba, since the Nagle algorithm is designed to
optimise interacive traffic with character echo, and the case
where you "win" never happens in SMB traffic].

Everything I have been describing here applies to read transactions,
but of course the same considerations would apply at the client
end when doing write transactions.  Maybe Microsoft forgot the
TCP_NODELAY?  Or some similar malfunction occurs.

I would suggest watching the traffic between client and server
with tcpdump: you would hope for the gaps between packets to be
small and fairly constant - but the pattern I was observing
gave bursts of rapid transmission separated by pauses of over

Good luck!

Andrew Gordon.

Want to link to this message? Use this URL: <"45f6-970114231001-B849*/G=Andrew/S=Gordon/O=NET-TEL Computer Systems Ltd/PRMD=NET-TEL/ADMD=Gold 400/C=GB/">