From owner-freebsd-hackers Thu Dec 7 21:18:14 2000 From owner-freebsd-hackers@FreeBSD.ORG Thu Dec 7 21:18:06 2000 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mail.kyx.net (unknown [216.232.16.88]) by hub.freebsd.org (Postfix) with ESMTP id 2F75637B400 for ; Thu, 7 Dec 2000 21:18:06 -0800 (PST) Received: from smp.kyx.net (unknown [10.22.22.45]) by mail.kyx.net (Postfix) with SMTP id 8BF6E1DC03; Thu, 7 Dec 2000 21:19:11 -0800 (PST) From: Dragos Ruiu Organization: kyx.net To: tcpdump-workers@tcpdump.org, ethereal-dev@ethereal.com, snort-devel@lists.sourceforge.net, freebsd-hackers@freebsd.org, tech@openbsd.org Subject: Fwd: kyxtech: freebsd outsniffed by wintendo !!?!? Date: Thu, 7 Dec 2000 21:06:04 -0800 X-Mailer: KYX-CP/M [version core00-mail-92] Content-Type: text/plain MIME-Version: 1.0 Message-Id: <0012072118150Q.09615@smp.kyx.net> Content-Transfer-Encoding: 8bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG (Hurm.... Wintendo outperforming unix???!?? Something's improper about this, and it ought to be fixed... :-) Comments? Other OS numbers: more recent FreeBSD versions? Solaris? Tru64? Optimization patches? Can those OO MSDN lobotomies actually be good things? Hurm... The Italian gauntlet has been thrown down.... --dr :-) url: http://netgroup-serv.polito.it/winpcap/docs/performance.htm Performance and tests 1. Packet Capture Driver Performance The main goal of a packet capture driver are performance. This means low use of system resources (memory and processor) but also low probability of loosing packets. The following main parameters influence the performances of the capture process: the efficiency of the filter, size of the packet buffer, the number of bytes copied and the number of system call that needs to be executed by the application. 1. The efficiency of the packet filter is a very important parameter, because the filter must be applied to every incoming packet (i.e. thousands of times per second). The packet capture driver uses the fast and highly optimized BPF filter (for more details about the performances of BPF filter, see [McCanne and Jacobson 1993]), whose virtual-processor architecture is suited for modern computers architectures. 2. More optimized packet filters have been developed after the original BPF. The more interesting for this kind of applications are MPF [13], and BPF+ [12]. The packet capture driver does not offer at the moment the advenced features of these two filters. It could be very useful to include in the driver the possibility to efficiently handle similar filters in a way similar to MPF. 3. Kernel buffer's size is the parameter that influences the number of packet loss during a capture; a bigger buffer means lower loss probability. Since the correct size of the buffer is a very subjective parameter and depends on various factors like network speed or machine characteristics, the packet capture driver offers a dynamic buffer that can be set to any size whenever the user wants to do that. In this way it is possible to set very big buffers on machines with an huge amount of RAM. Notice however that the buffer is freed when the driver's instance is closed, therefore the memory is used by the driver only during the capture process (i.e. when really needed). 4. Performances are strongly influenced by the number of bytes that need to be copied by the system. This task can absorb a lot of processor time and buffer memory. To overcome the problem, the packet capture driver applies the filter to an incoming packet as soon as it arrives to the system: the packet is filtered when it is still in the NIC driver's memory, without copying it. This means that no copy is needed to filter the packet. The filter tells how many bytes of the packets are needed by the user-level application (for example WinDump needs only the first 68 bytes of each packet). The packet capture driver copies only this amount of bytes (instead of the whole packet) to the circular buffer. This is very important also because reduces the space occupied by the packet in the circular buffer that is used more efficiently. The selected packet is then copied to the user-level application during a read system call. Summarizing, there are two copies of the cut packet, none of the entire packet that is equivalent of the number of copies done by the UNIX version. 5. Each read system call implies a context switch from user-mode (ring 3) to kernel-mode (ring 0) plus another another to return to user-mode. This process is notoriously slow and can influence the capture performances. Since a user-level application might want to look at every packet on the network and the time between packets can be only a few microseconds, it is not possible to do a read system call for each packet. The packet capture driver collects the data from several packets and copies it to the application's buffers in a single read call. The number of packets copied is not fixed and depends on the dimension of the application's buffer that will receive the packets: the driver detects the size of this buffer, and copies packets to it until it's full. Therefore, it is possible to decrease the number of system calls increasing the size of the application's read buffer. 2. Tests This Section aims at giving some indications about the performance of the capture process on various operating systems. Results obtained under the various Windows platforms have been compared with the ones provided by BPF/libpcap/TCPdump in FreeBSD 3.3 in order to determine the goodness of our implementation. 2.1 Testbed The testbed (shown in next figure) involves two PCs directly connected by means of a Fast Ethernet link. This assures the isolation of the testbed from external sources (our LAN), allowing accurate tests. A Windows NT workstation using the 'TG' tool (available into the developer's pack) based on the packet capture device driver generates traffic. This program is able to send data to the network using almost directly NDIS primitives, avoiding the overhead of the upper protocols and assuring the highest transfer rate compared to other traffic generator tools. Packet sizes have been selected in such way to generate the maximum amount of packets per second, that is usually the worst operating situation for a network analyzer. Packet sizes that maximized the number of packet sent was 101 bytes, as shown in next figure. The generated traffic is usually able to fill all the available bandwidth and there is no other traffic on that link. Tests are repeated several times in order to get accurate results and it has been derived their average value. Operating Systems under tests are installed in different disk partitions on the same workstation in order to avoid differences due to the hardware. Traffic is sent to a non-existent host in order not to have any interaction between the workstations. The second PC sets the interface in promiscuous mode and captures the traffic using WinDump / TCPdump in various configurations. Depending on the test type, packets captured are either saved to disk, printed on screen or dropped. The top program in FreeBSD, the task manager in Windows NT4/2000 and cpumeterin Windows 98 are the programs used to measure the CPU load. First two tools are shipped with the operating system, while the third one is available on the Internet. WinDump tested was version 2.02; TCPdump was the one included in the FreeBSD 3.3 distribution (TCPdump version 3.4, libpcap version 0.4). Even if our tests manage to isolate the impact of each subsystem (BPF and filtering, BPF and copying overhead), results are not able to compare exactly the performances of each component. This is due to the different architecture of the various versions, and to the impossibility to isolate each component from interacting one to the others and to the Operating System. In our opinion, the most representative test is test number 3 that measures performances "as a whole", including the packet driver, libpcap, WinDump as well as the operating system overhead (kernel processing, data transfer to disk, etc). The reason is that the "whole system" performance is what the end user is most interested in. 2.2 Results Test 1: Filter performances This test aims at measuring the impact of the BPF filter machine on the capture process. Packets are received by the network tap and checked by the BPF filter. The filter receives and processes all the packets sent. WinDump/TCPdump is launched with the following command line: windump 'filter' Where 'filter' is a packet filter with the TCPdump syntax. This test was executed with two different filters: 'udp': accepts only UDP packets. It is made by 5 instructions. 'src host 1.1.1.1 and dst host 2.2.2.2': accepts only packets coming from 1.1.1.1 and going to 2.2.2.2. This filter is a bit more complex, and is made by 13 instructions. Since no packet satisfying these filters passes on the network (because all the packets are generated by the TG tool), the filters reject all the packets. In this way, there is no copy and no interaction with the application. Only the filter function uses system resources. The filtering function does not use memory, so what is interesting to see here is the processor usage, shown in next figure. The figure shows that differences between different OSs are very limited. This is what we expect and confirms that our choice to create the BPF as a protocol was good enough to compete with original BPF. CPU load varies among different platforms, remaining however at acceptable levels. Windows platforms have sensibly better results. This is due probably to the fact that NDIS usually invokes the packet_tap function before the DMA transfer of the packet is finished, giving a bit of time to BPF. bpf_tap, instead, is called after the end of the DMA transfer. Notice finally that the values are very similar for the two filters. This confirms that the BPF filtering machine is well optimized, and that its efficiency increases with longer filters. Test 2: Driver's performance For this second test, a "fake" capture application based on libpcap was created and compiled for Windows and FreeBSD. This application receives ALL the packets from the driver (setting an accept-all filter), but discards them without any further processing, because the libpcap 'callback' function is empty. All the packets are processed by the underlying network levels, then by the packet driver, but there is NO packet processing at user level. The portion of packet to be accepted can be decided by user. This test aims at a global evaluation of efficiency of packet driver, including the copy processes from the interface driver to the kernel buffer and then to the user buffer. There was no filter in these tests, so the filtering function does not influence the results. Next figure shows CPU usage for various combinations "packet length-portion copied". FreeBSD has better performance than Windows, mainly because the tap function, that in FreeBSD is very simple and fast, in Windows is more complex and slow. For longer packets (i.e. for lower frequencies) the CPU use under FreeBSD decreases, but this does not happen in windows. This results stems from the "delayed write" ability of the UNIX BPF, as explained in Section 2.1. For high packets frequencies, the CPU load of the different systems is quite similar. However the system calls frequency (and therefore the CPU load) under UNIX decreases considerably when the size of incoming packets increases (i.e. the frequency is lower), while in Windows it remains stable. This behavior is not a problem for Windows implementations because it uses more CPU time only when it is available. Figure 7, in fact, shows that all systems loose very few packets. Test 3: WinDump capture performance In our opinion, this is the most important test, because involves the use of WinDump in order to measure the entire capture process. No filter is defined. Packets are captured and stored on disk using the "-w" WinDump option. Next figure shows the results when, for each packet, a "snapshot" of 68 bytes is saved on file, i.e. when the "windump -w test.acp" is executed. Results are very interesting when the network is overloaded by an high number of packets per second (i.e. packet size 101 bytes, that means about 67000 packets per second). All systems suffer noticeable losses: a certain amount of packets is lost for the lack of CPU time (a new packet arrives while the tap was processing a previous one), while others are dropped because the kernel buffer has no more space to hold them. It can be noted that Windows versions work noticeably better than the FreeBSD one. This is due mainly to the better buffering method of windows versions. Windows NT 4 is able to 'detect' less packets than FreeBSD (i.e. the number of packet received by filter is lower), but saves to disk 20% more packets. Windows 98 has a very good behavior compared to FreeBSD, but the real surprise is the Windows 2000 that is able to save to disk 73% of the packets on a NTFS partition, and 89% on a FAT partition. Since the packet driver for Windows 2000 is very similar to the one for Windows NT 4, the differences are due mainly to the improvements of NDIS and of file systems brought to Windows 2000. The heaviness of the file system is in fact a very important parameter in a test like this: notice that the same machine can capture under Windows 2000 a larger amount of packets if used with a faster file system like FAT32. This is one of the reasons because Windows 98 is faster than Windows NT 4. When the dimension of the packets grows (i.e. packet size 500 and 1514 bytes), the situation becomes less critical because the frequency of the packets decreases, but the portion saved is always 68 bytes. The values obtained tend to become more similar, and also the slower systems have good results. Next figure shows the results when the whole packets are saved to disk, i.e. when the "windump -s 1514 -w test.acp" is executed. (graphic ommitted) This is quite a hard test, and the kernel buffer is very important. In fact every packet must be entirely stored, and tends to fill the kernel buffer, especially with big packet sizes. FreeBSD is the system with the most serious problems because of its smaller buffer. Windows 2000 remains the system with better results, above all with long packets where it's the only able to capture without loosing anything. Discussion First and second tests show that the Windows implementation of BPF architecture has approximately the same impact on the system than the FreeBSD one and performances are quite similar; differences are located in the CPU load, where FreeBSD is the clear winner. This is due to the fewer code needed to implement the architecture in FreeBSD (because of the possibility to modify the OS sources) compared to the Windows one and to the "delayed write" capability of TCPdump. The results obtained are very important because show that our main goal, i.e. to create a free capture architecture with performance comparable with BPF for UNIX, has been reached. Moreover, the results show that the choice to implement the packet driver at protocol level is good enough to obtain performance comparable with the one of BPF in UNIX. However the interesting test from the end-user standpoint is the third one, because it shows the behavior of the BPF capture driver in conjunction with the most important tool based on it: WinDump. WinDump for Windows2000 is the clear winner and TCPdump for FreeBSD is the clear looser. While the BPF architecture performs on Windows 2000 like on other systems, Windows 2000 shows the best performances because of its optimized storage management. Packets are quickly saved on file, therefore buffers are freed and the incoming packets can be received with a small number of drops. FreeBSD is the clear looser because of its different buffering architecture that is not able to sustain heavy data rates. Notice that WinDump has been launched with the standard kernel buffer (1MB); in presence of heavy traffic the size of this buffer can be increased with a simple command line switch, improving further the overall performance of the system. Our conclusions are that BPF architecture for Windows performs well, that the dynamic buffer improves effectively the overall performances and that, among all the Windows flavors, Windows 2000 is the best platform for an high performance network analyzer. -- Dragos Ruiu dursec.com ltd. / kyx.net - we're from the future gpg/pgp key on file at wwwkeys.pgp.net To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message