3.  Results of our observations

      When 4.2BSD was first installed on several large timesharing systems the degradation in performance was significant. Informal measurements showed 4.2BSD providing 80% of the throughput of 4.1BSD (based on load averages observed under a normal timesharing load). Many of the initial problems found were because of programs that were not part of 4.1BSD. Using the techniques described in the previous section and standard process profiling several problems were identified. Later work concentrated on the operation of the kernel itself. In this section we discuss the problems uncovered; in the next section we describe the changes made to the system.

3.1.  User programs

     

3.1.1.  Mail system

      The mail system was the first culprit identified as a major contributor to the degradation in system performance. At Lucasfilm the mail system is heavily used on one machine, a VAX-11/780 with eight megabytes of memory.[note 3] Message traffic is usually between users on the same machine and ranges from person-to-person telephone messages to per-organization distribution lists. After conversion to 4.2BSD, it was immediately noticed that mail to distribution lists of 20 or more people caused the system load to jump by anywhere from 3 to 6 points. The number of processes spawned by the sendmail program and the messages sent from sendmail to the system logging process, syslog, generated significant load both from their execution and their interference with basic system operation. The number of context switches and disk transfers often doubled while sendmail operated; the system call rate jumped dramatically. System accounting information consistently showed sendmail as the top cpu user on the system.

3.1.2.  Network servers

      The network services provided in 4.2BSD add new capabilities to the system, but are not without cost. The system uses one daemon process to accept requests for each network service provided. The presence of many such daemons increases the numbers of active processes and files, and requires a larger configuration to support the same number of users. The overhead of the routing and status updates can consume several percent of the cpu. Remote logins and shells incur more overhead than their local equivalents. For example, a remote login uses three processes and a pseudo-terminal handler in addition to the local hardware terminal handler. When using a screen editor, sending and echoing a single character involves four processes on two machines. The additional processes, context switching, network traffic, and terminal handler overhead can roughly triple the load presented by one local terminal user.

3.2.  System overhead

      To measure the costs of various functions in the kernel, a profiling system was run for a 17 hour period on one of our general timesharing machines. While this is not as reproducible as a synthetic workload, it certainly represents a realistic test. This test was run on several occasions over a three month period. Despite the long period of time that elapsed between the test runs the shape of the profiles, as measured by the number of times each system call entry point was called, were remarkably similar.

      These profiles turned up several bottlenecks that are discussed in the next section. Several of these were new to 4.2BSD, but most were caused by overloading of mechanisms which worked acceptably well in previous BSD systems. The general conclusion from our measurements was that the ratio of user to system time had increased from 45% system / 55% user in 4.1BSD to 57% system / 43% user in 4.2BSD.

3.2.1.  Micro-operation benchmarks

      To compare certain basic system operations between 4.1BSD and 4.2BSD a suite of benchmark programs was constructed and run on a VAX-11/750 with 4.5 megabytes of physical memory and two disks on a MASSBUS controller. Tests were run with the machine operating in single user mode under both 4.1BSD and 4.2BSD. Paging was localized to the drive where the root file system was located.

      The benchmark programs were modeled after the Kashtan benchmarks, [Kashtan80], with identical sources compiled under each system. The programs and their intended purpose are described briefly before the presentation of the results. The benchmark scripts were run twice with the results shown as the average of the two runs. The source code for each program and the shell scripts used during the benchmarks are included in the Appendix.

      The set of tests shown in Table 1 was concerned with system operations other than paging. The intent of most benchmarks is clear. The result of running signocsw is deducted from the csw benchmark to calculate the context switch overhead. The exec tests use two different jobs to gauge the cost of overlaying a larger program with a smaller one and vice versa. The ``null job'' and ``big job'' differ solely in the size of their data segments, 1 kilobyte versus 256 kilobytes. In both cases the text segment of the parent is larger than that of the child.[note 5] All programs were compiled into the default load format that causes the text segment to be demand paged out of the file system and shared between processes.

D'l |120u 0'
TestDescription
D'l |120u 0'
syscallperform 100,000 getpid system calls
cswperform 10,000 context switches using signals
signocswsend 10,000 signals to yourself
pipeself4send 10,000 4-byte messages to yourself
pipeself512send 10,000 512-byte messages to yourself
pipediscard4send 10,000 4-byte messages to child who discards
pipediscard512send 10,000 512-byte messages to child who discards
pipeback4exchange 10,000 4-byte messages with child
pipeback512exchange 10,000 512-byte messages with child
forks0fork-exit-wait 1,000 times
forks1ksbrk(1024), fault page, fork-exit-wait 1,000 times
forks100ksbrk(102400), fault pages, fork-exit-wait 1,000 times
vforks0vfork-exit-wait 1,000 times
vforks1ksbrk(1024), fault page, vfork-exit-wait 1,000 times
vforks100ksbrk(102400), fault pages, vfork-exit-wait 1,000 times
execs0nullfork-exec ``null job''-exit-wait 1,000 times
execs0null (1K env)execs0null above, with 1K environment added
execs1knullsbrk(1024), fault page, fork-exec ``null job''-exit-wait 1,000 times
execs1knull (1K env)execs1knull above, with 1K environment added
execs100knullsbrk(102400), fault pages, fork-exec ``null job''-exit-wait 1,000 times
vexecs0nullvfork-exec ``null job''-exit-wait 1,000 times
vexecs1knullsbrk(1024), fault page, vfork-exec ``null job''-exit-wait 1,000 times
vexecs100knullsbrk(102400), fault pages, vfork-exec ``null job''-exit-wait 1,000 times
execs0bigfork-exec ``big job''-exit-wait 1,000 times
execs1kbigsbrk(1024), fault page, fork-exec ``big job''-exit-wait 1,000 times
execs100kbigsbrk(102400), fault pages, fork-exec ``big job''-exit-wait 1,000 times
vexecs0bigvfork-exec ``big job''-exit-wait 1,000 times
vexecs1kbigsbrk(1024), fault pages, vfork-exec ``big job''-exit-wait 1,000 times
vexecs100kbigsbrk(102400), fault pages, vfork-exec ``big job''-exit-wait 1,000 times
Table 1. Kernel Benchmark programs.

      The results of these tests are shown in Table 2. If the 4.1BSD results are scaled to reflect their being run on a VAX-11/750, they correspond closely to those found in [Joy80].[note 7]

D'l |696u 0'
Berkeley Software Distribution UNIX Systems
D'l |696u 0'
Elapsed TimeUser TimeSystem Time
D'l |696u0'
4.14.24.34.14.24.34.14.24.3
Test
D'l |696u 0'D'l |696u 0'
syscall28.029.023.04.55.33.523.923.720.4
csw45.060.045.03.54.33.319.525.419.0
signocsw16.523.016.01.93.01.114.620.115.2
pipeself421.529.026.01.11.10.820.128.025.6
pipeself51247.559.055.01.21.21.046.158.354.2
pipediscard432.042.036.03.23.73.015.518.815.6
pipediscard51261.076.069.03.12.12.029.736.433.2
pipeback457.075.066.02.93.23.325.134.229.7
pipeback512110.0138.0125.03.13.42.252.265.757.7
forks037.541.022.00.50.30.334.537.621.5
forks1k40.043.022.00.40.30.336.038.821.6
forks100k217.5223.0176.00.70.60.4214.3218.4175.2
vforks034.537.022.00.50.60.527.328.517.9
vforks1k35.037.022.00.60.80.527.228.617.9
vforks100k35.037.022.00.60.80.627.628.917.9
execs0null97.592.066.03.82.40.668.782.548.6
execs0null (1K env)197.0229.075.04.12.60.9167.8212.362.6
execs1knull99.0100.066.04.11.90.670.586.848.7
execs1knull (1K env)199.0230.075.04.22.60.7170.4214.962.7
execs100knull283.5278.0216.04.82.81.1251.9269.3202.0
vexecs0null100.092.066.05.12.71.163.776.845.1
vexecs1knull100.091.066.05.22.81.163.277.145.1
vexecs100knull100.092.066.05.13.01.164.077.745.6
execs0big129.0201.0101.04.03.01.0102.6153.592.7
execs1kbig130.0202.0101.03.73.01.0104.7155.593.0
execs100kbig318.0385.0263.04.83.11.1286.6339.1247.9
vexecs0big128.0200.0101.04.63.51.698.5149.690.4
vexecs1kbig125.0200.0101.04.73.51.398.9149.388.6
vexecs100kbig126.0200.0101.04.23.41.399.5151.089.0
                           D'l 0 |0u-1v'
                        D'l 0 |0u-1v'
                    D'l 0 |0u-1v'
                     D'l 0 |0u-1v'
                  D'l 0 |0u-1v'
               D'l 0 |0u-1v'
           D'l 0 |0u-1v'
            D'l 0 |0u-1v'
         D'l 0 |0u-1v'
      D'l 0 |0u-1v'
  D'l 0 |0u-1v'
   D'l 0 |0u-1v'
                             D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 2. Kernel Benchmark results (all times in seconds).

      In studying the measurements we found that the basic system call and context switch overhead did not change significantly between 4.1BSD and 4.2BSD. The signocsw results were caused by the changes to the signal interface, resulting in an additional subroutine invocation for each call, not to mention additional complexity in the system's implementation.

      The times for the use of pipes are significantly higher under 4.2BSD because of their implementation on top of the interprocess communication facilities. Under 4.1BSD pipes were implemented without the complexity of the socket data structures and with simpler code. Further, while not obviously a factor here, 4.2BSD pipes have less system buffer space provided them than 4.1BSD pipes.

      The exec tests shown in Table 2 were performed with 34 bytes of environment information under 4.1BSD and 40 bytes under 4.2BSD. To figure the cost of passing data through the environment, the execs0null and execs1knull tests were rerun with 1065 additional bytes of data. The results are show in Table 3.

D'l |480u 0'
RealUserSystem
D'l |480u0'
4.14.24.14.24.14.2
Test
D'l |480u 0'D'l |480u 0'
execs0null197.0229.04.12.6167.8212.3
execs1knull199.0230.04.22.6170.4214.9
                  D'l 0 |0u-1v'
              D'l 0 |0u-1v'
               D'l 0 |0u-1v'
            D'l 0 |0u-1v'
        D'l 0 |0u-1v'
         D'l 0 |0u-1v'
      D'l 0 |0u-1v'
  D'l 0 |0u-1v'
   D'l 0 |0u-1v'
                    D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 3. Benchmark results with ``large'' environment (all times in seconds).

These results show that passing argument data is significantly slower than under 4.1BSD: 121 ms/byte versus 93 ms/byte. Even using this factor to adjust the basic overhead of an exec system call, this facility is more costly under 4.2BSD than under 4.1BSD.

3.2.2.  Path name translation

      The single most expensive function performed by the kernel is path name translation. This has been true in almost every UNIX kernel [Mosher80]; we find that our general time sharing systems do about 500,000 name translations per day.

      Name translations became more expensive in 4.2BSD for several reasons. The single most expensive addition was the symbolic link. Symbolic links have the effect of increasing the average number of components in path names to be translated. As an insidious example, consider the system manager that decides to change /tmp to be a symbolic link to /usr/tmp. A name such as /tmp/tmp1234 that previously required two component translations, now requires four component translations plus the cost of reading the contents of the symbolic link.

      The new directory format also changes the characteristics of name translation. The more complex format requires more computation to determine where to place new entries in a directory. Conversely the additional information allows the system to only look at active entries when searching, hence searches of directories that had once grown large but currently have few active entries are checked quickly. The new format also stores the length of each name so that costly string comparisons are only done on names that are the same length as the name being sought.

      The net effect of the changes is that the average time to translate a path name in 4.2BSD is 24.2 milliseconds, representing 40% of the time processing system calls, that is 19% of the total cycles in the kernel, or 11% of all cycles executed on the machine. The times are shown in Table 4. We have no comparable times for namei under 4.1 though they are certain to be significantly less.

D'l |192u 0'
parttime% of kernel
D'l |192u 0'
self14.3 ms/call11.3%
child9.9 ms/call7.9%
D'l |192u 0'
total24.2 ms/call19.2%
        D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 4. Call times for namei in 4.2BSD.

3.2.3.  Clock processing

      Nearly 25% of the time spent in the kernel is spent in the clock processing routines. (This is a clear indication that to avoid sampling bias when profiling the kernel with our tools we need to drive them from an independent clock.) These routines are responsible for implementing timeouts, scheduling the processor, maintaining kernel statistics, and tending various hardware operations such as draining the terminal input silos. Only minimal work is done in the hardware clock interrupt routine (at high priority), the rest is performed (at a lower priority) in a software interrupt handler scheduled by the hardware interrupt handler. In the worst case, with a clock rate of 100 Hz and with every hardware interrupt scheduling a software interrupt, the processor must field 200 interrupts per second. The overhead of simply trapping and returning is 3% of the machine cycles, figuring out that there is nothing to do requires an additional 2%.

3.2.4.  Terminal multiplexors

      The terminal multiplexors supported by 4.2BSD have programmable receiver silos that may be used in two ways. With the silo disabled, each character received causes an interrupt to the processor. Enabling the receiver silo allows the silo to fill before generating an interrupt, allowing multiple characters to be read for each interrupt. At low rates of input, received characters will not be processed for some time unless the silo is emptied periodically. The 4.2BSD kernel uses the input silos of each terminal multiplexor, and empties each silo on each clock interrupt. This allows high input rates without the cost of per-character interrupts while assuring low latency. However, as character input rates on most machines are usually low (about 25 characters per second), this can result in excessive overhead. At the current clock rate of 100 Hz, a machine with 5 terminal multiplexors configured makes 500 calls to the receiver interrupt routines per second. In addition, to achieve acceptable input latency for flow control, each clock interrupt must schedule a software interrupt to run the silo draining routines.[note 9] [note 12] This implies that the worst case estimate for clock processing is the basic overhead for clock processing.

3.2.5.  Process table management

      In 4.2BSD there are numerous places in the kernel where a linear search of the process table is performed:

These linear searches can incur significant overhead. The rule for calculating the size of the process table is: nproc = 20 + 8 * maxusers

that means a 48 user system will have a 404 slot process table. With the addition of network services in 4.2BSD, as many as a dozen server processes may be maintained simply to await incoming requests. These servers are normally created at boot time which causes them to be allocated slots near the beginning of the process table. This means that process table searches under 4.2BSD are likely to take significantly longer than under 4.1BSD. System profiling shows that as much as 20% of the time spent in the kernel on a loaded system (a VAX-11/780) can be spent in schedcpu and, on average, 5-10% of the kernel time is spent in schedcpu. The other searches of the proc table are similarly affected. This shows the system can no longer tolerate using linear searches of the process table.

3.2.6.  File system buffer cache

      The trace facilities described in section 2.3 were used to gather statistics on the performance of the buffer cache. We were interested in measuring the effectiveness of the cache and the read-ahead policies. With the file system block size in 4.2BSD four to eight times that of a 4.1BSD file system, we were concerned that large amounts of read-ahead might be performed without being used. Also, we were interested in seeing if the rules used to size the buffer cache at boot time were severely affecting the overall cache operation.

      The tracing package was run over a three hour period during a peak mid-afternoon period on a VAX 11/780 with four megabytes of physical memory. This resulted in a buffer cache containing 400 kilobytes of memory spread among 50 to 200 buffers (the actual number of buffers depends on the size mix of disk blocks being read at any given time). The pertinent configuration information is shown in Table 5.

D'l |264u 0'
ControllerDriveDeviceFile System
D'l |264u 0'
DEC MASSBUSDEC RP06hp0d/usr
hp0bswap
Emulex SC780Fujitsu Eaglehp1a/usr/spool/news
hp1bswap
hp1e/usr/src
hp1d/u0 (users)
Fujitsu Eaglehp2a/tmp
hp2bswap
hp2d/u1 (users)
Fujitsu Eaglehp3a/
           D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 5. Active file systems during buffer cache tests.

      During the test period the load average ranged from 2 to 13 with an average of 5. The system had no idle time, 43% user time, and 57% system time. The system averaged 90 interrupts per second (excluding the system clock interrupts), 220 system calls per second, and 50 context switches per second (40 voluntary, 10 involuntary).

      The active virtual memory (the sum of the address space sizes of all jobs that have run in the previous twenty seconds) over the period ranged from 2 to 6 megabytes with an average of 3.5 megabytes. There was no swapping, though the page daemon was inspecting about 25 pages per second.

      On average 250 requests to read disk blocks were initiated per second. These include read requests for file blocks made by user programs as well as requests initiated by the system. System reads include requests for indexing information to determine where a file's next data block resides, file system layout maps to allocate new data blocks, and requests for directory contents needed to do path name translations.

      On average, an 85% cache hit rate was observed for read requests. Thus only 37 disk reads were initiated per second. In addition, 5 read-ahead requests were made each second filling about 20% of the buffer pool. Despite the policies to rapidly reuse read-ahead buffers that remain unclaimed, more than 90% of the read-ahead buffers were used.

      These measurements showed that the buffer cache was working effectively. Independent tests have also showed that the size of the buffer cache may be reduced significantly on memory-poor system without severe effects; we have not yet tested this hypothesis [Shannon83].

3.2.7.  Network subsystem

      The overhead associated with the network facilities found in 4.2BSD is often difficult to gauge without profiling the system. This is because most input processing is performed in modules scheduled with software interrupts. As a result, the system time spent performing protocol processing is rarely attributed to the processes that really receive the data. Since the protocols supported by 4.2BSD can involve significant overhead this was a serious concern. Results from a profiled kernel show an average of 5% of the system time is spent performing network input and timer processing in our environment (a 3Mb/s Ethernet with most traffic using TCP). This figure can vary significantly depending on the network hardware used, the average message size, and whether packet reassembly is required at the network layer. On one machine we profiled over a 17 hour period (our gateway to the ARPANET) 206,000 input messages accounted for 2.4% of the system time, while another 0.6% of the system time was spent performing protocol timer processing. This machine was configured with an ACC LH/DH IMP interface and a DMA 3Mb/s Ethernet controller.

      The performance of TCP over slower long-haul networks was degraded substantially by two problems. The first problem was a bug that prevented round-trip timing measurements from being made, thus increasing retransmissions unnecessarily. The second was a problem with the maximum segment size chosen by TCP, that was well-tuned for Ethernet, but was poorly chosen for the ARPANET, where it causes packet fragmentation. (The maximum segment size was actually negotiated upwards to a value that resulted in excessive fragmentation.)

      When benchmarked in Ethernet environments the main memory buffer management of the network subsystem presented some performance anomalies. The overhead of processing small ``mbufs'' severely affected throughput for a substantial range of message sizes. In spite of the fact that most system ustilities made use of the throughput optimal 1024 byte size, user processes faced large degradations for some arbitrary sizes. This was specially true for TCP/IP transmissions [Cabrera84, Cabrera85].

3.2.8.  Virtual memory subsystem

      We ran a set of tests intended to exercise the virtual memory system under both 4.1BSD and 4.2BSD. The tests are described in Table 6. The test programs dynamically allocated a 7.3 Megabyte array (using sbrk(2)) then referenced pages in the array either: sequentially, in a purely random fashion, or such that the distance between successive pages accessed was randomly selected from a Gaussian distribution. In the last case, successive runs were made with increasing standard deviations.

D'l |120u 0'
TestDescription
D'l |120u 0'
seqpagesequentially touch pages, 10 iterations
seqpage-vas above, but first make vadvise(2) call
randpagetouch random page 30,000 times
randpage-vas above, but first make vadvise call
gausspage.130,000 Gaussian accesses, standard deviation of 1
gausspage.10as above, standard deviation of 10
gausspage.30as above, standard deviation of 30
gausspage.40as above, standard deviation of 40
gausspage.50as above, standard deviation of 50
gausspage.60as above, standard deviation of 60
gausspage.80as above, standard deviation of 80
gausspage.infas above, standard deviation of 10,000
   D'l 0 |0u-1v'
     D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 6. Paging benchmark programs.

      The results in Table 7 show how the additional memory requirements of 4.2BSD can generate more work for the paging system. Under 4.1BSD, the system used 0.5 of the 4.5 megabytes of physical memory on the test machine; under 4.2BSD it used nearly 1 megabyte of physical memory.[note 13] This resulted in more page faults and, hence, more system time. To establish a common ground on which to compare the paging routines of each system, we check instead the average page fault service times for those test runs that had a statistically significant number of random page faults. These figures, shown in Table 8, show no significant difference between the two systems in the area of page fault servicing. We currently have no explanation for the results of the sequential paging tests.

D'l |624u 0'
RealUserSystemPage Faults
D'l |624u0'
4.14.24.14.24.14.24.14.2
Test
D'l |624u 0'D'l |624u 0'
seqpage959112616.712.8197.0213.01713217113
seqpage-v5798123.85.3216.0237.783948351
randpage5715696.77.664.077.280859776
randpage-v5725626.17.362.277.581269852
gausspage.1252423.623.80.80.888
gausspage.10262622.723.03.23.622
gausspage.30343325.024.88.68.922
gausspage.40428123.925.011.513.63260
gausspage.5011317524.226.219.626.37841851
gausspage.6019123427.626.727.436.020673177
gausspage.8031232928.027.941.552.039335105
gausspage.inf61962182.985.668.381.580469650
                        D'l 0 |0u-1v'
                    D'l 0 |0u-1v'
                     D'l 0 |0u-1v'
                  D'l 0 |0u-1v'
              D'l 0 |0u-1v'
               D'l 0 |0u-1v'
            D'l 0 |0u-1v'
        D'l 0 |0u-1v'
         D'l 0 |0u-1v'
      D'l 0 |0u-1v'
  D'l 0 |0u-1v'
   D'l 0 |0u-1v'
                          D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 7. Paging benchmark results (all times in seconds).

D'l |336u 0'
Page FaultsPFST
D'l |336u0'
4.14.24.14.2
Test
D'l |336u 0'D'l |336u 0'
randpage80859776791789
randpage-v81269852765786
gausspage.inf80469650848844
            D'l 0 |0u-1v'
        D'l 0 |0u-1v'
         D'l 0 |0u-1v'
      D'l 0 |0u-1v'
  D'l 0 |0u-1v'
   D'l 0 |0u-1v'
              D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 8. Page fault service times (all times in microseconds).