newvm, section 2.

2. User Interface

This section outlines our new virtual memory interface as it is currently envisioned. The details of the system call interface are contained in Appendix A.

2.1. Regions

The virtual memory interface is designed to support both large, sparse address spaces as well as small, densely-used address spaces. In this context, ``small'' is an address space roughly the size of the physical memory on the machine, while ``large'' may extend up to the maximum addressability of the machine. A process may divide its address space up into a number of regions. Initially a process begins with four regions; a shared read-only fill-on-demand region with its text, a private fill-on-demand region for its initialized data, a private zero-fill-on-demand region for its uninitialized data and heap, and a private zero-fill-on-demand region for its stack. In addition to these regions, a process may allocate new ones. The regions may not overlap and the system may impose an alignment constraint, but the size of the region should not be limited beyond the constraints of the size of the virtual address space.

Each new region may be mapped either as private or shared. When it is privately mapped, changes to the contents of the region are not reflected to any other process that map the same region. Regions may be mapped read-only or read-write. As an example, a shared library would be implemented as two regions; a shared read-only region for the text, and a private read-write region for the global variables associated with the library.

A region may be allocated with one of several allocation strategies. It may map some memory hardware on the machine such as a frame buffer. Since the hardware is responsible for storing the data, such regions must be exclusive use if they are privately mapped.

A region can map all or part of a file. As the pages are first accessed, the region is filled in with the appropriate part of the file. If the region is mapped read-write and shared, changes to the contents of the region are reflected back into the contents of the file. If the region is read-write but private, changes to the region are copied to a private page that is not visible to other processes mapping the file, and these modified pages are not reflected back to the file.

The final type of region is ``anonymous memory''. Uninitialed data uses such a region, privately mapped; it is zero-fill-on-demand and its contents are abandoned when the last reference is dropped. Unlike a region that is mapped from a file, the contents of an anonymous region will never be read from or written to a disk unless memory is short and part of the region must be paged to a swap area. If one of these regions is mapped shared, then all processes see the changes in the region. This difference has important performance considerations; the overhead of reading, flushing, and possibly allocating a file is much higher than simply zeroing memory.

If several processes wish to share a region, then they must have some way of rendezvousing. For a mapped file this is easy; the name of the file is used as the rendezvous point. However, processes may not need the semantics of mapped files nor be willing to pay the overhead associated with them. For anonymous memory they must use some other rendezvous point. Our current interface allows processes to associate a descriptor with a region, which it may then pass to other processes that wish to attach to the region. Such a descriptor may be bound into the UNIX file system name space so that other processes can find it just as they would with a mapped file.

2.2. Shared memory as high speed interprocess communication

The primary use envisioned for shared memory is to provide a high speed interprocess communication (IPC) mechanism between cooperating processes. Existing IPC mechanisms (i.e. pipes, sockets, or streams) require a system call to hand off a set of data destined for another process, and another system call by the recipient process to receive the data. Even if the data can be transferred by remapping the data pages to avoid a memory to memory copy, the overhead of doing the system calls limits the throughput of all but the largest transfers. Shared memory, by contrast, allows processes to share data at any level of granularity without system intervention.

However, to maintain all but the simplest of data structures, the processes must serialize their modifications to shared data structures if they are to avoid corrupting them. This serialization is typically done with semaphores. Unfortunately, most implementations of semaphores are done with system calls. Thus processes are once again limited by the need to do two system calls per transaction, one to lock the semaphore, the second to release it. The net effect is that the shared memory model provides little if any improvement in interprocess bandwidth.

To achieve a significant improvement in interprocess bandwidth requires a large decrease in the number of system calls needed to achieve the interaction. In profiling applications that use serialization locks such as the UNIX kernel, one typically finds that most locks are not contested. Thus if one can find a way to avoid doing a system call in the case in which a lock is not contested, one would expect to be able to dramatically reduce the number of system calls needed to achieve serialization.

In our design, cooperating processes manage their semaphores in their own address space. In the typical case, a process executes an atomic test-and-set instruction to acquire a lock, finds it free, and thus is able to get it. Only in the (rare) case where the lock is already set does the process need to do a system call to wait for the lock to clear. When a process is finished with a lock, it can clear the lock itself. Only if the ``WANT'' flag for the lock has been set is it necessary for the process to do a system call to cause the other process(es) to be awakened.

Another issue that must be considered is portability. Some computers require access to special hardware to implement atomic interprocessor test-and-set. For such machines the setting and clearing of locks would all have to be done with system calls; applications could still use the same interface without change, though they would tend to run slowly.

The other issue of compatibility is with System V's semaphore implementation. Since the System V interface has been in existence for several years, and applications have been built that depend on this interface, it is important that this interface also be available. Although the interface is based on system calls for both setting and clearing locks, the same interface can be obtained using our interface without system calls in most cases.

This implementation can be achieved as follows. System V allows entire sets of semaphores to be set concurrently. If any of the locks are unavailable, the process is put to sleep until they all become available. Under our paradigm, a single additional semaphore is defined that serializes access to the set of semaphores being simulated. Once obtained in the usual way, the set of semaphores can be inspected to see if the desired ones are available. If they are available, they are set, the guardian semaphore is released and the process proceeds. If one or more of the requested set is not available, the guardian semaphore is released and the process selects an unavailable semaphores for which to wait. On being reawakened, the whole selection process must be repeated.

In all the above examples, there appears to be a race condition. Between the time that the process finds that a semaphore is locked, and the time that it manages to call the system to sleep on the semaphore another process may unlock the semaphore and issue a wakeup call. Luckily the race can be avoided. The insight that is critical is that the process and the kernel agree on the physical byte of memory that is being used for the semaphore. The system call to put a process to sleep takes a pointer to the desired semaphore as its argument so that once inside the kernel, the kernel can repeat the test-and-set. If the lock has cleared (and possibly the wakeup issued) between the time that the process did the test-and-set and eventually got into the sleep request system call, then the kernel immediately resumes the process rather than putting it to sleep. Thus the only problem to solve is how the kernel interlocks between testing a semaphore and going to sleep; this problem has already been solved on existing systems.