From owner-freebsd-hackers  Thu Dec  7 21:18:14 2000
From owner-freebsd-hackers@FreeBSD.ORG  Thu Dec  7 21:18:06 2000
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mail.kyx.net (unknown [216.232.16.88])
	by hub.freebsd.org (Postfix) with ESMTP id 2F75637B400
	for <freebsd-hackers@freebsd.org>; Thu,  7 Dec 2000 21:18:06 -0800 (PST)
Received: from smp.kyx.net (unknown [10.22.22.45])
	by mail.kyx.net (Postfix) with SMTP
	id 8BF6E1DC03; Thu,  7 Dec 2000 21:19:11 -0800 (PST)
From: Dragos Ruiu <dr@kyx.net>
Organization: kyx.net
To: tcpdump-workers@tcpdump.org, ethereal-dev@ethereal.com,
	snort-devel@lists.sourceforge.net, freebsd-hackers@freebsd.org,
	tech@openbsd.org
Subject: Fwd: kyxtech: freebsd outsniffed by wintendo !!?!?
Date: Thu, 7 Dec 2000 21:06:04 -0800
X-Mailer: KYX-CP/M [version core00-mail-92]
Content-Type: text/plain
MIME-Version: 1.0
Message-Id: <0012072118150Q.09615@smp.kyx.net>
Content-Transfer-Encoding: 8bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


(Hurm.... Wintendo outperforming unix???!??  Something's
 improper about this, and it ought to be fixed...  :-) 
 Comments?  Other OS numbers: more recent 
 FreeBSD versions? Solaris? Tru64? Optimization
 patches? Can those OO MSDN lobotomies actually
 be good things? Hurm... The Italian gauntlet has
 been thrown down....   --dr :-)

url: http://netgroup-serv.polito.it/winpcap/docs/performance.htm

Performance and tests

1. Packet Capture Driver Performance

 The main goal of a packet capture driver are performance. This means low use
of system resources (memory and processor) but also low  probability of loosing
packets.

 The following main parameters influence the performances of the capture
process: the efficiency of the filter, size of the packet buffer, the number of
bytes copied and the number of system call that needs to be executed by the
application. 

1. The efficiency of the packet filter is a very important parameter, because
the filter must be applied to every incoming packet (i.e. thousands of times
per second). The packet capture driver uses the fast and highly optimized BPF
filter (for more details about the performances of BPF filter, see [McCanne and
Jacobson 1993]), whose virtual-processor architecture is suited for modern
computers architectures. 

2. More optimized packet filters have been developed after the original BPF. The
more interesting for this kind of applications are MPF [13], and BPF+ [12]. The
packet capture driver does not offer at the moment the advenced features of
these two filters. It could be very useful to include in the driver the
possibility to efficiently handle similar filters in a way similar to MPF. 

3. Kernel buffer's size is the parameter that influences the number of packet
loss during a capture; a bigger buffer means lower loss probability. Since the
correct size of the buffer is a very subjective parameter and depends on
various factors like network speed or machine characteristics, the packet
capture driver offers a dynamic buffer that can be set to any size whenever the
user wants to do that. In this way it is possible to set very big buffers on
machines with an huge amount of RAM. Notice however that the buffer is freed
when the driver's instance is closed, therefore the memory is used by the
driver only during the capture process (i.e. when really needed). 

4.  Performances are strongly influenced by the number of bytes that need to
be copied by the system. This task can absorb a lot of processor time and
buffer memory. To overcome the problem, the packet capture driver applies the
filter to an incoming packet as soon as it arrives to the system: the packet is
filtered when it is still in the NIC driver's memory, without copying it. This
means that no copy is needed to filter the packet. The filter tells how many
bytes of the packets are needed by the user-level application (for example
WinDump needs only the first 68 bytes of each packet). The packet capture
driver copies only this amount of bytes (instead of the whole packet) to the
circular buffer. This is very important also because reduces the space occupied
by the packet in the circular buffer that is used more efficiently. The
selected packet is then copied to the user-level application during a read
system call. Summarizing, there are two copies of the cut packet, none of the
entire packet that is equivalent of the number of copies done by the UNIX
version.  

5. Each read system call implies a context switch from user-mode (ring 3) to
kernel-mode (ring 0) plus another another to return to user-mode. This process
is notoriously slow and can influence the capture performances. Since a
user-level application might want to look at every packet on the network and
the time between packets can be only a few microseconds, it is not possible to
do a read system call for each packet. The packet capture driver collects the
data from several packets and copies it to the application's buffers in a
single read call. The number of packets copied is not fixed and depends on the
dimension of the application's buffer that will receive the packets: the driver
detects the size of this buffer, and copies packets to it until it's full.
Therefore, it is possible to decrease the number of system calls increasing the
size of the application's read buffer.


2. Tests 

 This Section aims at giving some indications about the performance of the
capture process on various operating systems. Results obtained under the
various Windows platforms have been compared with the ones provided by
BPF/libpcap/TCPdump in FreeBSD 3.3 in order to determine the goodness of our
implementation.

2.1 Testbed

 The testbed (shown in next figure) involves two PCs directly connected by
means of a Fast Ethernet link. This assures the isolation of the testbed from
external sources (our LAN), allowing accurate tests.

 A Windows NT workstation using the 'TG' tool (available into the developer's
pack) based on the packet capture device driver generates traffic. This program
is able to send data to the network using almost directly NDIS primitives,
avoiding the overhead of the upper protocols and assuring the highest transfer
rate compared to other traffic generator tools.

 Packet sizes have been selected in such way to generate the maximum amount of
packets per second, that is usually the worst operating situation for a network
analyzer. Packet sizes that maximized the number of packet sent was 101 bytes,
as shown in next figure.


 The generated traffic is usually able to fill all the available bandwidth and
there is no other traffic on that link. Tests are repeated several times in
order to get accurate results and it has been derived their average value.

 Operating Systems under tests are installed in different disk partitions on
the same workstation in order to avoid differences due to the hardware. Traffic
is sent to a non-existent host in order not to have any interaction between the
workstations. The second PC sets the interface in promiscuous mode and captures
the traffic using WinDump / TCPdump in various configurations. Depending on the
test type, packets captured are either saved to disk, printed on screen or
dropped.

 The top program in FreeBSD, the task manager in Windows NT4/2000 and
cpumeterin Windows 98 are the programs used to measure the CPU load. First two
tools are shipped with the operating system, while the third one is available
on the Internet. WinDump tested was version 2.02; TCPdump was the one included
in the FreeBSD 3.3 distribution (TCPdump version 3.4, libpcap version 0.4).

 Even if our tests manage to isolate the impact of each subsystem (BPF and
filtering, BPF and copying overhead), results are not able to compare exactly
the performances of each component. This is due to the different architecture
of the various versions, and to the impossibility to isolate each component
from interacting one to the others and to the Operating System. In our opinion,
the most representative test is test number 3 that measures performances "as a
whole", including the packet driver, libpcap, WinDump as well as the operating
system overhead (kernel processing, data transfer to disk, etc). The reason is
that the "whole system" performance is what the end user is most interested in.


2.2 Results

Test 1: Filter performances

 This test aims at measuring the impact of the BPF filter machine on the
capture process. Packets are received by the network tap and checked by the BPF
filter. The filter receives and processes all the packets sent. WinDump/TCPdump
is launched with the following command line:

 windump 'filter'

 Where 'filter' is a packet filter with the TCPdump syntax. This test was
executed with two different filters:


 'udp': accepts only UDP packets. It is made by 5 instructions. 

 'src host 1.1.1.1 and dst host 2.2.2.2': accepts only packets coming from
1.1.1.1 and going to 2.2.2.2. This filter is a bit more complex, and is made by
13 instructions.

 Since no packet satisfying these filters passes on the network (because all
the packets are generated by the TG tool), the filters reject all the packets.
In this way, there is no copy and no interaction with the application. Only the
filter function uses system resources.

 The filtering function does not use memory, so what is interesting to see
here is the processor usage, shown in next figure.


 The figure shows that differences between different OSs are very limited.
This is what we expect and confirms that our choice to create the BPF as a
protocol was good enough to compete with original BPF. CPU load varies among
different platforms, remaining however at acceptable levels. Windows platforms
have sensibly better results. This is due probably to the fact that NDIS
usually invokes the packet_tap function before the DMA transfer of the packet
is finished, giving a bit of time to BPF. bpf_tap, instead, is called after the
end of the DMA transfer.

 Notice finally that the values are very similar for the two filters. This
confirms that the BPF filtering machine is well optimized, and that its
efficiency increases with longer filters. 

Test 2: Driver's performance

 For this second test, a "fake" capture application based on libpcap was
created and compiled for Windows and FreeBSD. This application receives ALL the
packets from the driver (setting an accept-all filter), but discards them
without any further processing, because the libpcap 'callback' function is
empty. All the packets are processed by the underlying network levels, then by
the packet driver, but there is NO packet processing at user level. The portion
of packet to be accepted can be  decided by user. This test aims at a global
evaluation of efficiency of packet driver, including the copy processes from
the interface driver to the kernel buffer and then to the user buffer. There
was no filter in these tests, so the filtering function does not influence the
results. Next figure shows CPU usage for various combinations "packet
length-portion copied".

 FreeBSD has better performance than Windows, mainly because the tap function,
that in FreeBSD is very simple and fast, in Windows is more complex and slow.
For longer packets (i.e. for lower frequencies) the CPU use under FreeBSD
decreases, but this does not happen in windows. This results stems from the
"delayed write" ability of the UNIX BPF, as explained in Section 2.1. For high
packets frequencies, the CPU load of the different systems is quite similar.
However the system calls frequency (and therefore the CPU load) under UNIX
decreases considerably when the size of incoming packets increases (i.e. the
frequency is lower), while in Windows it remains stable.

 This behavior is not a problem for Windows implementations because it uses
more CPU time only when it is available. Figure 7, in fact, shows that all
systems loose very few packets.

  
Test 3: WinDump capture performance

 In our opinion, this is the most important test, because involves the use of
WinDump in order to measure the entire capture process. No filter is defined.
Packets are captured and stored on disk using the "-w" WinDump option. 

 Next figure  shows the results when, for each packet, a "snapshot" of 68
bytes is saved on file, i.e. when the "windump -w test.acp" is executed.

 Results are very interesting when the network is overloaded by an high number
of packets per second (i.e. packet size 101 bytes, that means about 67000
packets per second). All systems suffer noticeable losses: a certain amount of
packets is lost for the lack of CPU time (a new packet arrives while the tap
was processing a previous one), while others are dropped because the kernel
buffer has no more space to hold them. It can be noted that Windows versions
work noticeably better than the FreeBSD one. This is due mainly to the better
buffering method of windows versions. Windows NT 4 is able to 'detect' less
packets than FreeBSD (i.e. the number of packet received by filter is lower),
but saves to disk 20% more packets. Windows 98 has a very good behavior
compared to FreeBSD, but the real surprise is the  Windows 2000 that is able to
save to disk 73% of the packets on a NTFS partition, and 89% on a FAT
partition. Since the packet driver for Windows 2000 is very similar to the one
for Windows  NT 4, the differences are due mainly to the improvements of NDIS
and of file systems brought to Windows 2000. The heaviness of the file system
is in fact a very important parameter in a test like this: notice that the same
machine can capture under Windows 2000 a larger amount of packets if used with
a faster file system like FAT32. This is one of the reasons because Windows 98
is faster than Windows NT 4. 

 When the dimension of the packets grows (i.e. packet size 500 and 1514
bytes), the situation becomes less critical because the frequency of the
packets decreases, but the portion saved is always 68 bytes. The values
obtained tend to become more similar, and also the slower systems have good
results.

 Next figure  shows the results when the whole packets are saved to disk, i.e.
when the "windump -s 1514 -w test.acp" is executed.

(graphic ommitted)

 This is quite a hard test, and the kernel buffer is very important. In fact
every packet must be entirely stored, and tends to fill the kernel buffer,
especially with big packet sizes. FreeBSD is the system with the most serious
problems because of its smaller buffer. Windows 2000 remains the system with
better results, above all with long packets where it's the only able to capture
without loosing anything.

Discussion

 First and second tests show that the Windows implementation of BPF
architecture has approximately the same impact on the system than the FreeBSD
one and performances are quite similar; differences are located in the CPU
load, where FreeBSD is the clear winner. This is due to the fewer code needed
to implement the architecture in FreeBSD (because of the possibility to modify
the OS sources) compared to the Windows one and to the "delayed write"
capability of TCPdump. The results obtained are very important because show
that our main goal, i.e. to create a free capture architecture with performance
comparable with BPF for UNIX, has been reached. Moreover, the results show that
the choice to implement the packet driver at protocol level is good enough to
obtain performance comparable with the one of BPF in UNIX. 

 However the interesting test from the end-user standpoint is the third one,
because it shows the behavior of the BPF capture driver in conjunction with the
most important tool based on it: WinDump. WinDump for Windows2000 is the clear
winner and TCPdump for FreeBSD is the clear looser. While the BPF architecture
performs on Windows 2000 like on other systems, Windows 2000 shows the best
performances because of its optimized storage management. Packets are quickly
saved on file, therefore buffers are freed and the incoming packets can be
received with a small number of drops. FreeBSD is the clear looser because of
its different buffering architecture that is not able to sustain heavy data
rates.  

 Notice that WinDump has been launched with the standard kernel buffer (1MB);
in presence of heavy traffic the size of this buffer can be increased with a
simple command line switch, improving further the overall performance of the
system. Our conclusions are that BPF architecture for Windows performs well,
that the dynamic buffer improves effectively the overall performances and that,
among all the Windows flavors, Windows 2000 is the best platform for an high
performance network analyzer.


-- 
Dragos Ruiu <dr@dursec.com>   dursec.com ltd. / kyx.net - we're from the future 
gpg/pgp key on file at wwwkeys.pgp.net


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message