No subject

David G. Hough on validgh validgh!dghaSun.COM
Mon Feb 4 10:46:31 PST 1991


# This is a shell archive.  Remove anything before this line,
# then unpack it by saving it in a file and typing "sh file".
#
# Wrapped by validgh!dgh on Mon Feb  4 10:46:31 PST 1991
# Contents:  odell
 
echo x - odell
sed 's/^a//' > "odell" <<'a//E*O*F odell//'
a.pp
a.ls 2
a.fo ;From USENIX Proceedings;-%-;Anaheim June 1990
a.po +.75i
a.sp 5
a.ce 99
a.nf
\s+4\fB
Putting UNIX on Very Fast Computers

or

What the Speed of Light Means to Your Favorite System Call

\s-2\fP
Michael D. O'Dell
Consultant
\s-2
a.fi
a.ce 0
a.sh 1 "Abstract"
a.pp
A computer with a 250 MHz clock and built from leading-edge
technology works in fundamentally different ways compared with
a one-chip CMOS VLSI processor
clocking at less than 50 MHz.
The interactions between a UNIX implementation and its supporting
hardware have always been quite subtle and remain a considerable headache for
those charged with porting the system.
But in addition to the imprecisions of fuzzy functional definitions
for some key system facilities,
the laws of physics conspire to make the marriage of
very fast computers and modern UNIX systems an even more interesting
challenge than it would normally be.
This paper discusses some of the matchmaking necessary to achieve
a matrimonious accommodation.
a.sh 1 "Introduction"
a.pp
Computers which operate on the hairy edge of physics
are radically different beasts than the bus-based
minicomputers where UNIX was born. 
Many machines of this latter class are
now called ``workstations'' or ``servers.''
Traditional supercomputers are not ideal platforms for UNIX,
given their penchant for word addressing and limited
memory management hardware.
For many years, though,
they were the fastest game in town by a large margin,
so the architectural inconveniences in the name of raw speed were
simply endured.
Recently, though, this has started to change.
High-performance implementations
of what one might call the new ``commodity instruction sets'' hold the
promise of providing supercomputer-class scalar performance
with all the amenities that modern UNIX systems want to see
in the underlying hardware (support for demand-paged virtual
memory, for example).
a.pp
The modern reality is that,
rightly or wrongly, 
the software base available for a machine
goes a long way toward determining its success in the marketplace,
so building a new computer with Yet Another Instruction Set,
and therefore no existing software base,
is an undertaking fraught with enormous risks.
To the reasonably risk-averse person (e.g., the people who fund such projects),
it is clearly advantageous to \fIleverage*\fP 
a.(f
* Boardroom talk for ``use'', ``run'', or ``execute unmodified.''
a.)f
the large installed software
base associated with an established instruction set and operating system
implementation.
(The IBM PC clone-builders are some of the most successful proponents of
this approach.)
Going even further, if the new machine is sufficiently like its
archetype,
one can even leverage a very large part of the
base operating system software implementation.
This approach has the potential of maximizing return on investment
while minimizing risks.
The overall implication is a very high degree of compatibility even
at the level seen by the kernel,
with perfect compatibility when seen by user-mode programs.
As was discovered, however, this ``perfect user mode compatibility''
can be a rather elusive goal when the differences between the
underlying implementations are dramatic enough.
a.pp 
There are two ways to pursue the goal of building a mainframe-class,
very fast instruction-set clone.
The first is to use the highest performance
off-the-shelf processor chipset available as the core of the
machine and then concentrate on the rest of the computer to add value.
This is a potentially workable approach because the instruction
pipeline is only a small part of a large computer \- the memory and I/O
systems of a mainframe-class machine can easily consume
the lion's share of the logic packages and the engineering effort
needed to build the whole machine,
particularly if the processor is but a few VLSI chips.
The big advantage here
is that many of the subtle instruction-set compatibility issues are resolved
by simply buying a working processor chip.
One problem with this approach, however,
is that such a machine probably can't go dramatically faster than
any other machine built from the same, commonly available chips.
If one intends for performance to be the fundamental,
market-distinguishing characteristic of a new machine,
this may not be the best approach.
a.pp
The other alternative is to license a machine architecture from
one of the several purveyors,
and then set out to build a new implementation which
squeezes every iota of performance
from whatever base technology one feels comfortable choosing.
In this approach, the down side is that you must now build
\fIall\fP of the computer;
the up side which may make the approach worth the risk
is that one now gets a \fIchance\fP to achieve a real performance margin
over machines built with commodity processor parts.
At Prisma Computers, Inc., we embarked on this second path \- 
construction of a supercomputer which would be binary compatible with a
popular work\%station family, and built from ``whole cloth.''
For the purpose of this effort, the definition of
\fIbinary compatible\fP was beguilingly simple: if a binary program
runs on one of the to-be-cloned workstations,
it must run identically on the new machine,
only much, much faster.
a.pp
In the course of designing this computer and preparing an
operating system for it,
we discovered many strange and often
confounding things.
Most of these issues relate directly to what promises a machine
and its operating system must make to programs.
Particularly troubling are the dark corners or boundary conditions
in an architecture which ultimately are
revealed to a user program in great detail.
Programs often have deeply-ingrained notions about what they
expect to see in these revelations,
and what they have come to expect
of certain hitherto inadequately-specified operating
system features.
These features can be viewed as an example of ``inadequate law,''
so as with traditional jurisprudence,
one must fall back on the available historical precedent
which is seldom more compelling than the statement,
``This feature has always worked on other machines,
so my program is free to use it!''
Programs which take this view of the world
make ``absolute binary compatibility'' an interesting concept
and a tricky goal to achieve.
a.pp
At this point, it is useful to examine a new player in the compatibility
game: the Application Binary Interface, or ABI, which was developed for
UNIX System V Release 4.
To understand the importance of the ABI, we need to look at history for
a moment.
One of the great success factors in the IBM PC world has been the large
software base available for the machines.
Historically, this has been achieved by everyone simply
building the same computer (as seen by programs). 
In the PC world, the definition of ``compatible''  has been:
if a program or operating system works on
a genuine IBM Personal Computer of some flavor,
it should work on a Brand-Y PC clone; if not, the Brand-Y machine is
broken.
The problem is that in the UNIX world, the design space is much, much
larger than the PC world, so expecting everyone to build similar
machines isn't reasonable. However, if UNIX is to take off,
it is vitally important that all UNIX systems
using the same instruction set be able to execute the same binaries  so
vendors can market
``shrink-wrapped'' user application software.
The ABI, then,
is an attempt to provide the needed compatibility at the user level
while not overly constraining the operating system implementation.
a.pp
An ABI is created for a specific architecture and defines \fIall\fP system
interfaces available to user-level programs 
on any implementation of System V Release 4
running on that architecture.
The operating system's internal structure, on the other hand,
is allowed to be changed as desired
(e.g., for efficiency or performance), as long
the changes don't violate the ABI specification
seen by user-mode ABI-compliant
programs.
The goal, then, is that ABI-compliant binaries for a given architecture
should run on all implementations of that architecture, with
which provide the constant, basic functionality specified by the ABI.
In general, the ABI specifications try to be as high-level as possible,
striving to avoid documenting details which should ideally be left
as implementation choices.
However, a great many things must be specified if programs are to achieve their
portability goals so some of the problems addressed later in this paper
are still real issues.
The reason that Prisma didn't simply provide an ABI-compliant interface,
instead of the more rigorous PC-style absolute user compatibility, was 
simply one of timing.
The operating system for the Prisma P1 was not going to be System V Release 4
until well after the first shipments;
and a very critical business issue was leveraging
the existing large third-party
software catalog without any changes.
Some day, ABI compliance will be sufficient for a new machine,
but it wasn't in Prisma's 
development time frame.
a.sh 1 "The Rules of the Game"
a.pp
The Prisma P1 was originally designed to a target clock cycle of
4 nanoseconds, and since the P1 was a RISC machine, the fervent hope was that
it execute very nearly one instruction per clock.
To give you a feel for this target clock rate,
a 4 nanosecond clock period translates to a 250 MHz clock frequency
(well into the
UHF television broadcasting band).
At these frequencies, signals are \fInot\fP digital.
It is quite true that the intent is that the signals 
represent ones and zeroes, 
but at these frequencies, the signals themselves are profoundly analog.
a.pp
If the machine is to run with a 4 nanosecond clock, the logic used in
the machine must have
loaded gate delays of about a hundred picoseconds,  with
the rise times of a 250 MHz ``square wave'' logic signal 
measured in a few tens of picoseconds.
The harmonics of such signals extend well above 2 GHz
into the serious microwave and radar frequencies.
This means that ``wires'' or ``traces'' on 
a circuit board are all transmission lines, and keeping the signals
contained in the desired paths
becomes an RF engineering task.
Because the signal traces are transmission lines, the
trace routing is further complicated by the
restriction there can be no ``stub'' paths
branching off a main run to a non-collinear connection on the way.
Instead, the signals must be routed in a purely linearizable
``connect-dots-without-lifting-your-pencil'' path.
Complicating matters even further, the transmission line traces were 
``differential pairs,'' which means that instead of
running one wire from place to place, Prisma ran two, with one
wire going from one to zero
while the other goes from zero to one at the same time. The
logic signal is taken as the difference between the two lines (hence the name
``differential'')
instead of between a single line and ground as in the case of standard TTL
logic.
This technique has important electrical propagation characteristics
as well as some handy logical advantages \-
for instance, inverters are almost unheard of \- just reverse the connections!
These traces must be routed on the circuit board maintaining a
specific distance between them in order
to preserve the characteristic impedance,
while being routed as a pair introducing only an \fIeven\fP
number of half-twists en route to maintain signal polarity.
PC board routing software which can adequately do this complex job proved
quite hard to come by.
a.pp
On real circuit board material
manufacturable at finite cost,
electronic signals
propagate noticeably slower than
the speed of light in a vacuum.  
The rule of thumb for the board materials used in the P1 was that 1 inch of
circuit board trace eats 180 picoseconds of flight time, so 10 inches
of trace will eat about half the available clock cycle.
Given the differential-pair stub-free routing requirements,
it was frighteningly 
easy to get routes on a 4 inch by 4 inch board which came dangerously
close to 10 inches,
thereby leaving the logic in such a path about 2 nanoseconds
in which to do all its work.
a.pp
All of these signalling and routing requirements also apply to
connectors and cables.
The most vexing issues are that signals generally propagate
even slower through cables, making impedance control for
minimize reflections a constant concern.
These issues particularly complicate the
physical construction of tiny, fragile connectors which require
many, many pins to interconnect modules with ever-growing complexity.
a.pp
One of the most serious problems Prisma faced was the level of 
integration available in the Gallium Arsenide (GaAs) technology
being used for the machine.
The GaAs logic chips were medium- to large-scale integration, meaning that
a chip contained between
about 200 and 500 gates, as opposed to 
the tens-of-thousands of gates available on a CMOS VLSI chip.
If a function needed more
gates than would fit on a chip, the signal path had to go off-chip,
paying dearly for driving the signals across the circuit board instead
of across the chip surface.
This made the partitioning and physical floor plan of the logic 
and the resulting chips quite critical.
There were never enough gates,
so only functions which were
absolutely essential for correct behavior
could be done in the GaAs logic.
The processor had no choice but to be a RISC \- there wasn't
enough real estate to allow alternate ideas.
a.pp
One other characteristic of the interconnection constraints
had a profound architectural impact.
At the speeds Prisma was targeting, there is no such thing
as an arbitrated ``bus.''
The data interconnects in the P1 were nominally 64-bit wide,
4ns clock, simplex point-to-point data paths.
An ``address path''  was just a data path which 
carried address information as the data and usually required fewer bits
than the 64-bit data paths.
A full-duplex data path (two simplex paths which 
connect the same two parts of the machine but in opposite 
directions) required about 300 wires total (the signals are
all differential, so 64 bits of data require 128 wires each direction;
differential parity and control signals consumed the rest).
These paths were hideously expensive in both space and time,
so adding even one path to the implementation required an overwhelming argument
in its favor \-
on the order of
``The machine cannot possibly get the right answer if it's not there.''
Because there were no busses carrying a combined data and address path,
there was no convenient place to put useful trinkets like
EPROMs or UARTs which have come to be expected for mundane tasks
like booting and printing last-gasp death messages when something
goes horribly wrong.
This constraint, more than any other, forced the machine to
have a service processor.
(The kernel development group successfully avoided having anything to do with
the service processor beyond defining some requirements and a few
necessary interfaces needed between the ersatz ``ROM'' code
placed into P1 memory by the service processor and the kernel.)
a.pp
The final, most important rule of the game was that the machine
had to be realizable
using parts which are actually available \- not vaporspecs for
non-existent chips,
but
a.ul real
DRAMs and
a.ul real
SRAMs and
a.ul real
logic chips
(or at least a real fabrication process),
running at speed, which one could buy (or make) in production quantities,
at rational prices.
This constraint is a genuine nuisance.
a.pp
All the rules from physics and engineering
discussed above induced two fundamental constraints which
profoundly affected everything in the design of the Prisma P1:

(1) Doing anything takes at least a clock, and

(2) Getting there to do it often takes at least a clock as well.

This means that the back end of the instruction pipe \fIcannot\fP
communicate with the front of the pipe in one clock cycle,
and therefore, there can be no ``atomic'' decisions whereby the instruction
fetch unit decides
what to do next based on the entire state of the CPU (e.g., take a
pagefault).
This presents a non-trivial problem.
a.sh 1 "Suggestions on How to Play the Game"
a.pp
The general problem to be addressed is that processing an instruction 
always takes too long,
either because the logic takes too long to get the answer, or that
the logic to be consulted is too far away as well as being too slow.
One of the most productive ways to hide latency is to increase parallelism,
thereby letting more operations progress at once.
This means that functions which are one clock on other machines
became pipelines on the P1 and therefore had to cope with multiple
outstanding operations.
a.pp
The biggest latency issue was the dramatic disparity between the speeds
of the memory system and the processor.  Since the machine was
to be relatively low cost and support large memories, hidden somewhere behind
the curtain had to be commodity 80 nanosecond DRAMS.
Note well that the 80 nanosecond number is \fIaccess\fP time,
not the transaction time,
which is on the order of 180-200 ns.
This limit drove the decision to make the memory system a 3-level hierarchy.
The first level instruction and data caches were integral
to the GaAs portion of CPU.  It was indeed a dark day when it became apparent
that the GaAs RAMs we had planned to use in the primary caches
were not real, and that high-performance ECL ``474'' RAM parts would have to
be used instead.
This meant that the primary caches would have to become pipelined:
the first 4 nanosecond cycle did the cache tag access, and the second cycle
accessed the cache line and returned data.  The primary caches 
could start an operation every clock and the data cache would have to support
multiple outstanding cache misses.
To do this for both loads and stores, the primary data cache implemented
the equivalent of load-store locks for all of main memory.
The second level cache was a very large, interleaved, integrated physical
cache front-ending the third-level DRAM arrays.
I/O traffic went through the secondary cache with
some clever tricks to prevent cache-wiping during I/O bursts.
For a load, a hit in the secondary cache returned data to the CPU register
in about 15 cycles round-trip (updating the primary data cache en route),
while awakening the DRAMs from their deep slumber required 60+ cycles
round-trip.
a.pp
As was intimated in the description of the primary data cache,
load and store instructions were implemented as non-blocking operations.
For example, a primary data cache miss caused by a load did not stall
the pipe until a later instruction attempted to use the destination
register of the load as a source operand.
Because the cache miss could be answered by either the secondary cache
or the DRAM array,
load instructions could complete out of order.
When coupled with the non-atomic behavior of the pipeline, 
this could cause
multiple outstanding pagefaults to occur,
and resulted in considerable grief for the
operating system.
While the non-blocking loads and stores added significantly
to the complexity of both the hardware and software,
the system simulator clearly showed that this
feature had a dramatic impact on machine performance
and that the work in the kernel was well worth the effort.
a.pp
Another area impacted by the speed disparity between the processor and
memory system was the architecture
of the memory management unit (MMU)\- both in the page table structure and the
translation look-aside-buffer (TLB).
In some circles, with processors operating at more modest performance levels,
it is considered reasonable to implement an MMU
with only a software-reloaded TLB; but for a machine like the P1,
such a design is untenable for performance reasons.
The P1 was designed to be a supercomputer and to run processes
requiring large address spaces, so the TLB-miss processing speed,
directly related to the average page-table walk length, was an
important issue.
The ``reference MMU'' for the P1's underlying RISC processor uses
a short-circuit 3-level page table with 4Kbyte leaf pages.
The problem with this design (from the P1 perspective)
is that the deep page table tree required a large and rather clever
TLB design to prevent multiple memory references when walking the
page tables for a TLB miss.
While CMOS implementations of the reference MMU easily have enough
transistors to support the architecture,
the required cleverness was beyond the means of the GaAs
implementation technology used in the P1.
Simulation studies of alternative designs led back to a two-level version of
the reference MMU using 8Kbyte pages.  This approach allowed the TLB
to be divided in half, dedicating half to level-one entries and half
to level-two entries.  The resulting design allowed processes
with very large address
spaces to exhibit very low TLB miss rates, 
and when a miss did occur, the mean page-table walk length was well
under one entry.
The entire MMU design effort was a case where the synergy of developing
the kernel and the hardware in parallel proved extremely valuable.
Several ``strong candidate'' designs were actually coded into the
architectural simulator being used for kernel development and were evaluated
by running programs (including the operating system)
before the final design was frozen.
a.pp
In summary, building a machine as fast as we intended the P1 to be
requires all the tricks you can think of, and probably more.
It also requires a great deal of attention to matters which
are second-order effects (if not third!) on slower machines.  
In fact, you almost end up standing on your head;
the first-order and other-order issues almost seem to be reversed
in this world - things which matter a lot in slower or more
cost-sensitive designs are often down in the noise, and things which 
most designs don't attempt to do anything about matter a great deal.
a.sh 1 "Great Expectations \- What UNIX Wants from Hardware"
a.pp
UNIX systems at least as early as the Sixth Edition
used their hardware in interesting
and aggressive ways.
The technique used by the V6 system to grow the stack segment
on demand assumed instruction restart capability which
was easy to do on the DEC PDP-11/45 and PDP-11/70 because of the
splendid support in the MMU.
Providing the equivalent capability on other members of the PDP-11 product line
was considerably less trivial.
The early microprocessor-based UNIX implementations (V7 from UNISOFT
on the Codata 3200, as an example) had to deal with the unavailability
of instruction restart on Motorola 68000 processors.  Various tricks
were devised, but the general approach was to add a ``sacrificial
instruction'' to the procedure prolog code which did a probe of
the address space designed
to provoke a stack-growing protection trap if required.
The instruction could not be salvaged, so the MMU trap recovery code
looked at the offending instruction to see if it had the designated
sacrificial form and seemed to be just probing the stack.  If so,
then the instruction was just ignored.  This is the
first example known to the author where a restricted 
but unprivileged instruction sequence was used to paper over an inadequacy
of the underlying hardware behavior.
Other UNIX implementations have used similar work-arounds to deal with
various problems, often relating to MMU and trap recovery behavior,
but the days of those tricks covering for inadequate hardware support
are over.
a.pp
Modern UNIX systems such as System V Release 4 demand much
more from the underlying hardware.
While one might argue academically that copy-on-write is not strictly
required for a System V Release 4 implementation, it is most certainly required
if the implementation is not to be at a serious disadvantage in the
real marketplace.  Furthermore, support for facilities such as copy-on-write
must not exact excessive performance penalties.
a.pp
Adding to the complexity of providing these facilities at high speed
are new processor architectures which specify
what we call ``atomic trap observance.''
Atomic traps appear to happen as part of the instruction execution.
For example, a load instruction which pagefaults is specified as
causing the fault, rather than completing the load,
as part of the instruction execution.
The machine must begin the trap processing instead of executing the next
instruction after a pagefault.
a.pp
The inclusion of this atomic behavior in the architecture is understandable, 
particularly if you have ever worked on fault recovery
code on a complex machine.
The beguilingly simple specification is a blessing for the low-level kernel
programmer because it makes fault recovery quite
straightforward and
generally makes instruction restart quite easy to achieve.
However, if the MMU which detects the pagefault is 3 or 4 pipeline stages
removed from the instruction fetch unit,
and when, because of out-of-order completion, 
at least two of those pipeline stages can irrevocably
alter registers 
which were used to form the effective address of the load or store,
the operating system is faced with several serious problems.
a.pp
The first problem is simply assuring the fundamental semantic correctness
of operations
such as copy-on-write by ``re-effecting'' out-of-order instructions
(since one cannot just backup the program counter!).
The approach taken in the P1 was to replace the usual ``MMU Registers''
which record the necessary trap recovery information
with a pagefault queue whose
entries provide enough information to ``re-effect'' the faulting
instruction in spite of its out-of-order completion.
a.pp
For the private purposes of the operating system kernel, the MMU Fault Queue
provided all the support needed to recover from pagefaults and continue,
and it worked quite well on the simulator.
The simulator used by the kernel group actually allowed more asynchrony than
the real P1, thereby catching a few subtle bugs which would have been
much harder to find running ``at speed'' on a real P1.
The real nightmare, however, related to facilities afforded to
User Programs.
a.pp
Modern UNIX systems (or possibly more aptly, systems gene-spliced with TENEX)
now have virtual memory systems which let user programs
actively participate in memory management functions
by allowing them to explicitly manipulate their memory mappings.
This capability is not a problem in and of itself, but rather,
serves as the courier of an engraved invitation to Hell.
In this Brave New World, programs may register
signal handlers for pagefaults and in turn, 
they expect to take the fault, diddle some mapping information,
and return from the fault to
carry on as if nothing happened.
All the clever design work to make the P1 go fast and still let
the operating system transparently make things come out
right seemed to be torpedoed by this one requirement.
Several loud voices suggested that such programs should just get what they
so richly deserved,
but when it was pointed out that the Bourne shell was 
one of the most commonly-executed offenders*,
that alternative was clearly unacceptable.
a.(f
* The Bourne shell uses \fIsbrk()\fP to
manipulate the mappings instead of \fImmap()\fP, but it does catch the 
appropriate signal
and expects to return and restart the failing instruction nonetheless.
a.)f
The stated business requirements for absolute binary compatibility
were also persuasive.
Even more unacceptable, however, was slowing down the entire machine
by a large factor just to support infrequent exception handling.
a.pp
Interestingly enough, most architectures already have ``non-atomic'' or
deferred traps generated by long-running coprocessor instructions
(usually floating-point), and programs
seem to have resigned themselves to dealing with them
as a matter of course.
An interesting case to consider is what must be done in a language with
frame-based exception handling (Ada or Modula-3) to deal with returning from a
subroutine before all the floating-point instructions issued
in the subroutine have finished.
The unfinished instructions are still subject to the exception-handling
wrapper around the invoking subroutine call. Therefore, an explicit instruction
to synchronize the state of the floating-point coprocessor must be used
in the procedure return sequence
to prevent the integer pipeline sequence from completing too soon
(thereby abolishing the
exception-frame environment while incomplete coprocessor instructions
could still cause exceptions).
This need to be explicitly aware of the potential asynchrony 
has resulted in lowered expectations
as to exactly what information can be revealed about the
exact state of the machine when the trap is induced, and
lowered expectations about what a program can do as a result
of a trap (delivered to the user program as a UNIX signal),
at least for coprocessor traps.
The problem with atomic trap behavior for loads and stores is that it
essentially requires synchronizing the entire memory system pipe on \fIeach\fP
load or store instruction, and this completely defeats all the effort
to pipeline those functions in order minimize latency by maximizing overlap.
It is unfortunate that the direction processor architectures in general
seem to be headed is toward
more synchrony and lock-step determinism in such behaviors
when it can be readily shown that reducing synchrony and determinism
is a powerful way to improve performance.
a.pp
Another expectation that UNIX systems have about machines has to
do with the way kernel and user address spaces co-exist.
The UNIX system at issue clearly wanted the kernel to be mapped
into the upper one-half gigabyte of the user's 4 gigabyte virtual
address space, leaving 3.5 gigabytes for user programs.
Indeed, the System V Release 4 ABI  for the P1's underlying RISC architecture
requires that the valid user virtual address space be
3.5 gigabytes starting at location zero, so changing the virtual
memory layout in a substantial way was never a realistic option.
Since the primary caches use virtual address tags for reasons of speed,
redundantly mapping
the kernel into each user address space caused the
\fIentire\fP kernel to alias in every address space.
Since the primary cache also had process-id context bits in the tags
to prevent unnecessary flushing and cache-wiping, a naive
implementation would 
require flushes on every context switch to insure kernel
coherency.
The first alternative (discarded rather quickly) was simply marking the
entire kernel ``no-cache'' in the page tables, but the performance
impact was too severe.
The alternative chosen was to modify the MMU behavior equations so that the
uppermost .5 gigabytes of virtual space always got processed
as if in context zero, with the proper considerations given
to the supervisor or user mode of the processor.
This prevented aliasing by forcing the kernel to only appear
in one context.
a.pp
The advantage of this design is that it allowed aggressive
cache-flush prevention algorithms which were added to the virtual memory
system to perform quite effectively.
The disadvantage is that to some degree, the decision permanently forces
a considerable chunk of user virtual space to be dedicated to 
kernel use, whether or not it is needed down the road.
The implications of not making these decisions, however,
were considered to be much worse, based both on calculations and
tests validating the effectiveness of the
cache-flush prevention algorithms.
a.sh 1 "Great Assumptions \- What Programs Think UNIX Provides"
a.pp
The complexity of the UNIX virtual machine assumed by programs
(and only partially documented by the system call interface
manual pages) only seems to be increasing.
In the halcyon days of the Sixth Edition,
the manual page for the \fIsignal()\fP system call warns that signals
were primarily a way for a suddenly-dying program to discover its
assailant,
and not the event-based interprocess communications
mechanism they have become of late.
Admittedly this author would be the last person to complain about
signals becoming reliable, but rather, the complaint is that
the flexibility of the interface has not tracked the richness
of use.
Again, the issue of user programs intercepting pagefault traps
with a signal handler was at the center of the problem.
a.pp
In order to resolve the problem of the significant performance impact
caused by the architecture specifying atomic trap behavior for pagefaults,
a cooperative effort between Prisma and the purveyor
of the architecture produced a revision to the architectural specification
which provided that \fIdeferred\fP traps, already legal for
floating point operations, could also be legally generated by certain
additional kinds of exceptions (e.g., pagefaults), but \fIonly\fP when
a User Program has not requested visibility of the trap.
On the face of it, this sounds like a reasonable compromise.
The system is allowed to do whatever it wants, ripping along
as fast as possible, as long as no one asks to see the details.
On the other hand, if a process asks to watch the entrails
in action, the machine must behave in such a way as
to allow the operating system to reveal to the interested User Program
all the relevant gore about what
was happening, and in such a way that the user program can diddle
with the state and have the diddling take immediate effect.
And how does a program request to view the entrails?
By registering a signal handler, of course.
a.pp
This still sounds just fine \- programs that ask get hurt; those
that don't, don't.
There is still a serious flaw with in model, however.
A great many programs (all programs linked with the
FORTRAN run-time system, to pick just one group at random) 
routinely catch \fIall\fP signals in an attempt to print
the moral equivalent of:

	yourprogram bailing out near line 1 with Segmentation Violation

all in the name of ``user-friendliness.''
The program isn't going to try and recover from the disaster;
it only wants to 
report the fatal mistake and then die peaceably.
By registering the signal handler for the appropriate signal, however,
it incurs all the penalties of a program expecting to examine every trap while
trying to do its own demand paging.
As was said above, in the case of the Prisma P1, these
penalties have a quite serious performance impact.
The crux of the problem is that a
user of the \fIsignal()\fP system call doesn't have any way to request
a specific
``quality of service'' \- so the system must provide the most expensive,
guaranteed-to-be-sufficient
service, when the barest minimum would be perfectly adequate in many cases.
a.sh 1 "Great Consternations \- Life at the Boundary Conditions"
a.pp
No UNIX Programmer's Manual known to the author contains
detailed information about what questions a signal handler can expect to ask
of the machine;  about what state information must be revealed in
response to those questions;
nor about how a signal handler can expect to effect the state of
the running program, particularly by modifying the machine state revealed 
to the signal handler and then ``returning'' from the signal handler.
Indeed, in some programs, the signal handler doesn't ever return \-
it does a \fIlongjmp()\fP to forcibly redirect the control
flow of the program!
The ANSI C standard attempts some limited statement of allowable
behavior 
applicable in the broadest, most portable cases which must
work under essentially any operating system, but it is insufficient.
The current state of confusion exists because under specific
operating systems, certain
behaviors have been discovered to work
and they have largely been perpetuated, even if undocumented.
One solution is to categorize ``quality
of service'' for signal handlers, and thereby
allowing a handler to register its intent and its requirements
as to the completeness of information needed by the handler.
In this way, a program need not pay in reduced performance for
more information than it needs.
a.pp
Prisma implemented a very limited (lame?)
version of this quality-of-service notion
by providing a new system call named \fIsigdefer()\fP for manipulating a mask
which indicates that certain traps should remain
deferred even if the program registers a signal handler for them.
This mask is potentially ``sticky'' across \fIexec()\fP system calls,
thereby providing support for a \fBsigdefer\fP command
which takes another command as its argument.  The \fBsigdefer\fP command
sets the ``force deferred signal'' mask for the indicated
signals and then runs the command, thereby allowing it to run
unimpeded by \fIsignal()\fP calls intended to only print messages.
While knowledgeable programs can use the new system call
directly, the \fBsigdefer\fP command allows existing binaries
to run at full speed when the user decides it is safe.
Within the kernel group,
the \fBsigdefer\fP command was called, at various times,
\fBgofast\fP and \fBbenchmark\fP.
This is approach is admittedly a bit disreputable, but it solved the
problem until someone defines a quality of service interface for
signal handlers.
a.pp
In return for specifying some behavior more flexibly,
this author would like to encourage the use of the word
\fIundefined\fP more frequently in standards and architectural
behavior specifications.
While \fIdefined\fP means that a certain behavior is guaranteed,
the author intends \fIundefined\fP to mean: 
a.(q
Even if you can somehow discover how an \fIundefined\fP facility
behaves on an implementation, you are expressly prohibited
from using that knowledge in any program because any implementation
is completely free to give you a different answer EVERY time
you use it!
a.)q
Maybe the threat is enough to make people stick to the
\fIdefined\fP behavior, but it is delicious to contemplate
making an \fIundefined\fP behavior actually work differently each time!
a.pp
a.sh 1 "Comments and Conclusions"
a.pp
Building fast computers is an interesting and exciting challenge,
particularly when you first discover that \fIEverything You Know
Is Wrong!\fP
In doing the kernel for the P1,
the biggest single cause of grief was not the complexity arising from
nominal behavior of the machine
(even including multiple outstanding pagefaults),
but rather it was because programs,
especially in signal handlers, 
are allowed to ask arbitrary
questions and request arbitrary behavior of the machine at very
inopportune moments.
Given the current operating system interfaces (ABI specifications),
answering the questions and implementing the requested behavior can
only be done at the expense of serious performance penalties.
These penalties can unfortunately
be inflicted upon unsuspecting programs just trying to be friendly
with the user,
rather than only on those programs truly interested in exactly what the
machine is doing on a cycle-by-cycle basis.
a.pp
Historically, UNIX programs have taken a quite naive view of the world:
system calls always work (who checks error returns?),
\fImalloc()\fP never fails,
and a signal handler is free to crack stack-frames and diddle saved
register values to its heart's content with perfectly predictable
(or at least, experimentally discoverable on a specific machine) results.
Well, maybe you should check the return value from \fIunlink()\fP;
some machines rap your knuckles quite smartly
for dereferencing through Location Zero;
just maybe catching that signal is a lot more expensive than you
thought; and most certainly, meddling with the saved state to achieve
a particular effect is much harder than you thought.
In the future, machines will continue to get faster and more internally
concurrent, if not more parallel,
and continuing this fantasy of the world being
atomically lock-step-perfect simply flies in the face of the physical reality
of fast machines.
The sooner we can disabuse programs and ABI specifications 
from this fantasy, the sooner we can really exploit of the machines
we could build.
a.pp
The kernel for the Prisma P1 contained a lot of very creative work;
this paper has described only a small part of the issues we faced
in trying to make a workstation-tuned operating system safe for
a serious supercomputer.
The virtual memory system, the filesystem, the scheduler (see [Essick]),
the basic I/O system and fundamental system resource management were all
modified in various ways.
a.pp
One final note:
The kernel for the Prisma P1 was up and running in two forms.
A complete P1 kernel except for the lowest-level machine-specific
MMU hardware support and I/O system code ran on a production basis
on all the large compute servers at Prisma, day-in and day-out.
This system contained over 95% of the code in the ready-for-hardware P1 kernel.
The entire P1 kernel, complete with P1-specific MMU and I/O
system code,
was running solidly on the P1 Architectural Simulator under extreme
stress-test loads.
a.sh 1 "Acknowledgements"
a.pp
The author did not learn any of these things alone, rather they
were learned in a very creative and rewarding group effort.
Brian Berliner, Ray Essick, Bill Fuller, Tony Schene,
and the author comprised the Prisma P1 Kernel Team.
Bill Wallace made the Prisma software group a great place to work.
Chris ``Roz'' Rozewski and Rob Ober created the MMU
and the memory system for the Prisma P1,
and Mike Baird's group designed the I/O system.
The author is very grateful to all of them for their contributions
and the chance to work with them.
a.sh 1 "References"
a.ip Essick
An Event-based Fair Share Scheduler, Raymond B. Essick, January 1990
USENIX Proceedings, Washington DC, USENIX Association, Berkeley, CA.
a//E*O*F odell//
chmod u=rw,g=rw,o=r odell
 
echo Inspecting for damage in transit...
temp=/tmp/shar$$; dtemp=/tmp/.shar$$
trap "rm -f $temp $dtemp; exit" 0 1 2 3 15
cat > $temp <<\!!!
     860    6620   41450 odell
!!!
wc  odell | sed 's=[^ ]*/==' | diff -b $temp - >$dtemp
if [ -s $dtemp ]
then echo "Ouch [diff of wc output]:" ; cat $dtemp
else echo "No problems found."
fi
exit 0



More information about the Numeric-interest mailing list