This application is a continuation of U.S. patent application Ser. No. 10/436,340, filed May 13, 2003, which is a continuation of U.S. patent application Ser. No. 09/534,745, filed Mar. 24, 2000, now U.S. Pat. No. 6,643,765, which is a continuation of U.S. patent application Ser. No. 09/382,402, filed Aug. 24, 1999, now U.S. Pat. No. 6,295,599, and which is a continuation-in-part of U.S. patent application Ser. No. 09/169,963, filed Oct. 13, 1998, now U.S. Pat. No. 6,006,318, which is a continuation of U.S. patent application Ser. No. 08/754,827, filed Nov. 22, 1996, now U.S. Pat. No. 5,822,603, which is a division of U.S. patent application Ser. No. 08/516,036, filed Aug. 16, 1995, now U.S. Pat. No. 5,742,840.
This application is a continuation of U.S. patent application Ser. No. 11/511,466, filed Aug. 29, 2006, which is a continuation of U.S. patent application Ser. No. 10/646,787, filed Aug. 25, 2003, now U.S. Pat. No. 7,216,217, which is a continuation of U.S. patent application Ser. No. 09/922,319, filed Aug. 2, 2001, which is a continuation of U.S. patent application Ser. No. 09/382,402, filed Aug. 24, 1999, now U.S. Pat. No. 6,295,599, which claims the benefit of priority to Provisional Application No. 60/097,635 filed Aug. 24, 1998, and is a continuation-in-part of U.S. patent application Ser. No. 09/169,963, filed Oct. 13, 1998, now U.S. Pat. No. 6,006,318, which is a continuation of U.S. patent application Ser. No. 08/754,827, filed Nov. 22, 1996 now U.S. Pat. No. 5,822,603, which is a divisional of U.S. patent application Ser. No. 08/516,036, filed Aug. 16, 1995 now U.S. Pat. No. 5,742,840.
The contents of all the U.S. patent applications and provisional applications listed above are hereby incorporated by reference including their appendices in their entirety.
The present invention relates to general purpose processor architectures, and particularly relates to general purpose processor architectures capable of executing group operations.
The performance level of a processor, and particularly a general purpose processor, can be estimated from the multiple of a plurality of interdependent factors: clock rate, gates per clock, number of operands, operand and data path width, and operand and data path partitioning. Clock rate is largely influenced by the choice of circuit and logic technology, but is also influenced by the number of gates per clock. Gates per clock is how many gates in a pipeline may change state in a single clock cycle. This can be reduced by inserting latches into the data path: when the number of gates between latches is reduced, a higher clock is possible. However, the additional latches produce a longer pipeline length, and thus come at a cost of increased instruction latency. The number of operands is straightforward; for example, by adding with carry-save techniques, three values may be added together with little more delay than is required for adding two values. Operand and data path width defines how much data can be processed at once; wider data paths can perform more complex functions, but generally this comes at a higher implementation cost. Operand and data path partitioning refers to the efficient use of the data path as width is increased, with the objective of maintaining substantially peak usage.
Embodiments of the invention pertain to systems and methods for enhancing the utilization of a general purpose processor by adding classes of instructions. These classes of instructions use the contents of general purpose registers as data path sources, partition the operands into symbols of a specified size, perform operations in parallel, catenate the results and place the catenated results into a general-purpose register. Some embodiments of the invention relate to a general purpose microprocessor which has been optimized for processing and transmitting media data streams through significant parallelism.
Some embodiments of the present invention provide a system and method for improving the performance of general purpose processors by including the capability to execute group operations involving multiple floating-point operands. In one embodiment, a programmable media processor comprises a virtual memory addressing unit, a data path, a register file comprising a plurality of registers coupled to the data path, and an execution unit coupled to the data path capable of executing group-floating point operations in which multiple floating-point operations stored in partitioned fields of one or more of the plurality of registers are operated on to produce catenated results. The group floating-point operations may involve operating on at least two of the multiple floating-point operands in parallel. The catenated results may be returned to a register, and general purpose registers may used as operand and result registers for the floating-point operations. In some embodiments the execution unit may also be capable of performing group floating-point operations on floating-point data of more than one precision. In some embodiments the group floating-point operations may include group add, group subtract, group compare, group multiply and group divide arithmetic operations that operate on catenated floating-point data. In some embodiments, the group floating-point operations may include group multiply-add, group scale-add, and group set operations that operate on catenated floating-point data.
In one embodiment, the execution unit is also capable of executing group integer instructions involving multiple integer operands stored in partitioned fields of registers. The group integer operations may involve operating on at least two of the multiple integer operands in parallel. The group integer operations may include group add, group subtract, group compare, and group multiply arithmetic operations that operate on catenated integer data.
In one embodiment, the execution unit is capable of performing group data handling operations, including operations that copy, operations that shift, operations that rearrange and operations that resize catenated integer data stored in a register and return catenated results. The execution unit may also be configurable to perform group data handling operations on integer data having a symbol width of 8 bits, group data handling operations on integer data having a symbol width of 16 bits, and group data handling operations on integer data having a symbol width of 32 bits. In one embodiment, the operations are controlled by values in a register operand. In one embodiment, the operations are controlled by values in the instruction.
In one embodiment, the multi-precision execution unit is capable of executing a Galois field instruction operation.
In one embodiment, the multi-precision execution unit is configurable to execute a plurality of instruction streams in parallel from a plurality of threads, and the programmable media processor further comprises a register file associated with each thread executing in parallel on the multi-precision execution unit to support processing of the plurality of threads. In some embodiments, the multi-precision execution unit executes instructions from the plurality of instruction streams in a round-robin manner. In some embodiments, the processor ensures only one thread from the plurality of threads can handle an exception at any given time.
Some embodiments of the present invention provide a multiplier array that is fully used for high precision arithmetic, but is only partly used for other, lower precision operations. This can be accomplished by extracting the high-order portion of the multiplier product or sum of products, adjusted by a dynamic shift amount from a general register or an adjustment specified as part of the instruction, and rounded by a control value from a register or instruction portion. The rounding may be any of several types, including round-to-nearest/even; toward zero, floor, or ceiling. Overflows are typically handled by limiting the result to the largest and smallest values that can be accurately represented in the output result.
When an extract is controlled by a register, the size of the result can be specified, allowing rounding and limiting to a smaller number of bits than can fit in the result. This permits the result to be scaled for use in subsequent operations without concern of overflow or rounding. As a result, performance is enhanced. In those instances where the extract is controlled by a register, a single register value defines the size of the operands, the shift amount and size of the result, and the rounding control. By placing such control information in a single register, the size of the instruction is reduced over the number of bits that such an instruction would otherwise require, again improving performance and enhancing processor flexibility. Exemplary instructions are Ensemble Convolve Extract, Ensemble Multiply Extract, Ensemble Multiply Add Extract, and Ensemble Scale Add Extract. With particular regard to the Ensemble Scale Add Extract Instruction, the extract control information is combined in a register with two values used as scalar multipliers to the contents of two vector multiplicands. This combination reduces the number of registers otherwise required, thus reducing the number of bits required for the instruction.
In one embodiment, the processor performs load and store instructions operable to move values between registers and memory. In one embodiment, the processor performs both instructions that verify alignment of memory operands and instructions that permit memory operands to be unaligned. In one embodiment, the processor performs store multiplex instructions operable to move to memory a portion of data contents controlled by a corresponding mask contents. In one embodiment, this masked storage operation is performed by indivisibly reading-modifying-writing a memory operand.
In one embodiment, all processor, memory and interface resources are directly accessible to high-level language programs. In one embodiment, assembler codes and high-level language formats are specified to access enhanced instructions. In one embodiment interface and system state is memory mapped, so that it can be manipulated by compiled code. In one embodiment, software libraries provide other operations required by the ANSI/IEEE floating-point standard. In one embodiment, software conventions are employed at software module boundaries, in order to permit the combination of separately compiled code and to provide standard interfaces between application, library and system software. In one embodiment, instruction scheduling is performed by a compiler.
FIG. 1 is a system level diagram showing the functional blocks of a system according to the present invention.
FIG. 2 is a matrix representation of a wide matrix multiply in accordance with one embodiment of the present invention.
FIG. 3 is a further representation of a wide matrix multiple in accordance with one embodiment of the present invention.
FIG. 4 is a system level diagram showing the functional blocks of a system incorporating a combined Simultaneous Multi Threading and Decoupled Access from Execution processor in accordance with one embodiment of the present invention.
FIG. 5 illustrates a wide operand in accordance with one embodiment of the present invention.
FIG. 6 illustrates an approach to specifier decoding in accordance with one embodiment of the present invention.
FIG. 7 illustrates in operational block form a Wide Function Unit in accordance with one embodiment of the present invention.
FIG. 8 illustrates in flow diagram form the Wide Microcache control function.
FIG. 9 illustrates Wide Microcache data structures.
FIGS. 10 and 11 illustrate a Wide Microcache control.
FIG. 12 is a timing diagram of a decoupled pipeline structure in accordance with one embodiment of the present invention.
FIG. 13 further illustrates the pipeline organization of FIG. 12.
FIG. 14 is a diagram illustrating the basic organization of the memory management system according to the present embodiment of the invention.
FIG. 15 illustrates the physical address of an LTB entry for thread th, entry en, byte b.
FIG. 16 illustrates a definition for AccessPhysicalLTB.
FIG. 17 illustrates how various 16-bit values are packed together into a 64-bit LTB entry.
FIG. 18 illustrates global access as fields of a control register.
FIG. 19 shows how a single-set LTB context may be further simplified by reserving the implementation of the lm and la registers.
FIG. 20 shows the partitioning of the virtual address space if the largest possible space is reserved for an address space identifier.
FIG. 21 shows how the LTB protect field controls the minimum privilege level required for each memory action of read (r), write (w), execute (x), and gateway (g), as well as memory and cache attributes of write allocate (wa), detail access (da), strong ordering (so), cache disable (cd), and write through (wt).
FIG. 22 illustrates a definition for LocalTranslation.
FIG. 23 shows how the low-order GT bits of the th value are ignored, reflecting that 2GT threads share a single GTB.
FIG. 24 illustrates a definition for AccessPhysicalGTB.
FIG. 25 illustrates the format of a GTB entry.
FIG. 26 illustrates a definition for GlobalAddressTranslation.
FIG. 27 illustrates a definition for GTBUpdateWrite.
FIG. 28 shows how the low-order GT bits of the th value are ignored, reflecting that 2GT threads share single GTB registers.
FIG. 29 illustrates the registers GTBLast, GTBFirst, and GTBBump.
FIG. 30 illustrates a definition for AccessPhysicalGTBRegisters.
FIGS. 31A-31C illustrate Group Boolean instructions in accordance with an exemplary embodiment of the present invention.
FIG. 31D illustrates Group Multiplex instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 32A-32C illustrate Group Add instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 33A-33C illustrate Group Subtract and Group Set instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 34A-34C illustrate Ensemble Divide and Ensemble Multiply instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 35A-35C illustrate Group Compare instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 36A-36C illustrate Ensemble Unary instructions in accordance with an exemplary embodiment of the present invention.
FIG. 37 illustrates exemplary functions that are defined for use within the detailed instruction definitions in other sections.
FIGS. 38A-38C illustrate Ensemble Floating-Point Add, Ensemble Floating-Point Divide, and Ensemble Floating-Point Multiply instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 38D-38F illustrate Ensemble Floating-Point Multiply Add instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 38G-38I illustrate Ensemble Floating-Point Scale Add instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 39A-39C illustrate Ensemble Floating-Point Subtract instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 39D-39G illustrate Group Set Floating-point instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 40A-40C illustrate Group Compare Floating-point instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 41A-41C illustrate Ensemble Unary Floating-point instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 42A-42D illustrate Ensemble Multiply Galois Field instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 43A-43D illustrate Compress, Expand, Rotate, and Shift instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 43E-43G illustrate Shift Merge instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 43H-43J illustrate Compress Immediate, Expand Immediate, Rotate Immediate, and Shift Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 43K-43M illustrate Shift Merge Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 44A-44D illustrate Crossbar Extract instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 44E-44K illustrate Ensemble Extract instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 45A-45F illustrate Deposit and Withdraw instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 45G-45J illustrate Deposit Merge instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 46A-46E illustrate Shuffle instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 47A-47C illustrate Swizzle instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 47D-47E illustrate Select instructions in accordance with an exemplary embodiment of the present invention.
FIG. 48 is a pin summary describing the functions of various pins in accordance with the one embodiment of the present invention.
FIGS. 49A-49G present electrical specifications describing AC and DC parameters in accordance with one embodiment of the present invention.
FIGS. 50A-50C illustrate Load instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 51A-51C illustrate Load Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 52A-52C illustrate Store and Store Multiplex instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 53A-53C illustrate Store Immediate and Store Multiplex Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIGS. 54A-54D illustrate Data-Handling Operations in accordance with an exemplary embodiment of the present invention.
FIG. 54F illustrates Procedure Calling Conventions in accordance with an exemplary embodiment of the present invention.
FIG. 54G illustrates alignment withing the dp region in accordance with an exemplary embodiment of the present invention.
FIG. 54H illustrates gateway with pointers to code and data spaces in accordance with an exemplary embodiment of the present invention.
FIGS. 55-56 illustrate an expected rate at which memory requests are serviced in accordance with an exemplary embodiment of the present invention.
FIG. 57F is a pinout diagram in accordance with an exemplary embodiment of the present invention.
FIG. 58A-58C illustrate Always Reserved instructions in accordance with an exemplary embodiment of the present invention.
FIG. 59A-59C illustrate Address instructions in accordance with an exemplary embodiment of the present invention.
FIG. 60A-60C illustrate Address Compare instructions in accordance with an exemplary embodiment of the present invention.
FIG. 61A-61C illustrate Address Copy Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 62A-62C illustrate Address Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 63A-63C illustrate Address Immediate Reversed instructions in accordance with an exemplary embodiment of the present invention.
FIG. 64A-64C illustrate Address Reversed instructions in accordance with an exemplary embodiment of the present invention.
FIG. 65A-65C illustrate Address Shift Left Immediate Add instructions in accordance with an exemplary embodiment of the present invention.
FIG. 66A-66C illustrate Address Shift Left Immediate Subtract instructions in accordance with an exemplary embodiment of the present invention.
FIG. 67A-67C illustrate Address Shift Left Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 68A-68C illustrate Address Ternary instructions in accordance with an exemplary embodiment of the present invention.
FIG. 69A-69C illustrate Branch instructions in accordance with an exemplary embodiment of the present invention.
FIG. 70A-70C illustrate Branch Back instructions in accordance with an exemplary embodiment of the present invention.
FIG. 71A-71C illustrate Branch Barrier instructions in accordance with an exemplary embodiment of the present invention.
FIG. 72A-72C illustrate Branch Conditional instructions in accordance with an exemplary embodiment of the present invention.
FIG. 73A-73C illustrate Branch Conditional Floating-Point instructions in accordance with an exemplary embodiment of the present invention.
FIG. 74A-74C illustrate Branch Conditional Visibility Floating-Point instructions in accordance with an exemplary embodiment of the present invention.
FIG. 75A-75C illustrate Branch Down instructions in accordance with an exemplary embodiment of the present invention.
FIG. 76A-76C illustrate Branch Gateway instructions in accordance with an exemplary embodiment of the present invention.
FIG. 77A-77C illustrate Branch Halt instructions in accordance with an exemplary embodiment of the present invention.
FIG. 78A-78C illustrate Branch Hint instructions in accordance with an exemplary embodiment of the present invention.
FIG. 79A-79C illustrate Branch Hint Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 80A-80C illustrate Branch Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 81A-81C illustrate Branch Immediate Link instructions in accordance with an exemplary embodiment of the present invention.
FIG. 82A-82C illustrate Branch Link instructions in accordance with an exemplary embodiment of the present invention.
FIG. 83A-83C illustrate Store Double Compare Swap instructions in accordance with an exemplary embodiment of the present invention.
FIG. 84A-84C illustrate Store Immediate Inplace instructions in accordance with an exemplary embodiment of the present invention.
FIG. 85A-85C illustrate Store Inplace instructions in accordance with an exemplary embodiment of the present invention.
FIG. 86A-86C illustrate Group Add Halve instructions in accordance with an exemplary embodiment of the present invention.
FIG. 87A-87C illustrate Group Copy Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 88A-88C illustrate Group Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 89A-89C illustrate Group Immediate Reversed instructions in accordance with an exemplary embodiment of the present invention.
FIG. 90A-90C illustrate Group Inplace instructions in accordance with an exemplary embodiment of the present invention.
FIG. 91A-91C illustrate Group Shift Left Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 92A-92C illustrate Group Shift Left Immediate Subtract instructions in accordance with an exemplary embodiment of the present invention.
FIG. 93A-93C illustrate Group Subtract Halve instructions in accordance with an exemplary embodiment of the present invention.
FIG. 94A-94C illustrate Ensemble instructions in accordance with an exemplary embodiment of the present invention.
FIG. 95A-95E illustrate Ensemble Convolve Extract Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 96A-96E illustrate Ensemble Convolve Floating-Point instructions in accordance with an exemplary embodiment of the present invention.
FIG. 97A-97G illustrate Ensemble Extract Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 98A-98F illustrate Ensemble Extract Immediate Inplace instructions in accordance with an exemplary embodiment of the present invention.
FIG. 99A-99C illustrate Ensemble Inplace instructions in accordance with an exemplary embodiment of the present invention.
FIG. 100A-100E illustrate Wide Multiply Matrix instructions in accordance with an exemplary embodiment of the present invention.
FIG. 101A-101E illustrate Wide Multiply Matrix Extract instructions in accordance with an exemplary embodiment of the present invention.
FIG. 102A-102E illustrate Wide Multiply Matrix Extract Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 103A-103E illustrate Wide Multiply Matrix Floating-Point Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 104A-104D illustrate Wide Multiply Matrix Galois Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 105A-105C illustrate Wide Switch Immediate instructions in accordance with an exemplary embodiment of the present invention.
FIG. 106A-106C illustrate Wide Translate instructions in accordance with an exemplary embodiment of the present invention.
In various embodiments of the invention, a computer processor architecture, referred to here as MicroUnity's Zeus Architecture is presented. MicroUnity's Zeus Architecture describes general-purpose processor, memory, and interface subsystems, organized to operate at the enormously high bandwidth rates required for broadband applications.
The Zeus processor performs integer, floating point, signal processing and non-linear operations such as Galois field, table lookup and bit switching on data sizes from 1 bit to 128 bits. Group or SIMD (single instruction multiple data) operations sustain external operand bandwidth rates up to 512 bits (i.e., up to four 128-bit operand groups) per instruction even on data items of small size. The processor performs ensemble operations such as convolution that maintain full intermediate precision with aggregate internal operand bandwidth rates up to 20,000 bits per instruction. The processor performs wide operations such as crossbar switch, matrix multiply and table lookup that use caches embedded in the execution units themselves to extend operands to as much as 32768 bits. All instructions produce at most a single 128-bit register result, source at most three 128-bit registers and are free of side effects such as the setting of condition codes and flags. The instruction set design carries the concept of streamlining beyond Reduced Instruction Set Computer (RISC) architectures, to simplify implementations that issue several instructions per machine cycle.
The Zeus memory subsystem provides 64-bit virtual and physical addressing for UNIX, Mach, and other advanced OS environments. Separate address instructions enable the division of the processor into decoupled access and execution units, to reduce the effective latency of memory to the pipeline. The Zeus cache supplies the high data and instruction issue rates of the processor, and supports coherency primitives for scaleable multiprocessors. The memory subsystem includes mechanisms for sustaining high data rates not only in block transfer modes, but also in non-unit stride and scattered access patterns.
The Zeus interface subsystem is designed to match industry-standard “Socket 7” protocols and pin-outs. In this way, Zeus can make use of the immense infrastructure of the PC for building low-cost systems. The interface subsystem is modular, and can be replaced with appropriate protocols and pin-outs for lower-cost and higher-performance systems.
The goal of the Zeus architecture is to integrate these processor, memory, and interface capabilities with optimal simplicity and generality. From the software perspective, the entire machine state consists of a program counter, a single bank of 64 general-purpose 128-bit registers, and a linear byte-addressed shared memory space with mapped interface registers. All interrupts and exceptions are precise, and occur with low overhead.
Examples discussed herein are for Zeus software and hardware developers alike, and defines the interface at which their designs must meet. Zeus pursues the most efficient tradeoffs between hardware and software complexity by making all processor, memory, and interface resources directly accessible to high-level language programs.
Conformance
To ensure that Zeus systems may freely interchange data, user-level programs, system-level programs and interface devices, the Zeus system architecture reaches above the processor level architecture.
Optional Areas
Optional areas include:
Number of processor threads
Size of first-level cache memories
Existence of a second-level cache
Size of second-level cache memory
Size of system-level memory
Existence of certain optional interface device interfaces
Upward-Compatible Modifications
Additional devices and interfaces, not covered by this standard may be added in specified regions of the physical memory space, provided that system reset places these devices and interfaces in an inactive state that does not interfere with the operation of software that runs in any conformant system. The software interface requirements of any such additional devices and interfaces must be made as widely available as this architecture specification.
Unrestricted Physical Implementation
Nothing in this specification should be construed to limit the implementation choices of the conforming system beyond the specific requirements stated herein. In particular, a computer system may conform to the Zeus System Architecture while employing any number of components, dissipate any amount of heat, require any special environmental facilities, or be of any physical size.
Common Elements
Notation
The descriptive notation used in this document is summarized in the table below:
| descriptive notation | |
| x + y | two's complement addition of x and y. Result is the same size as |
| the operands, and operands must be of equal size. | |
| x − y | two's complement subtraction of y from x. Result is the same |
| size as the operands, and operands must be of equal size. | |
| x * y | two's complement multiplication of x and y. Result is the same |
| size as the operands, and operands must be of equal size. | |
| x/y | two's complement division of x by y. Result is the same size as |
| the operands, and operands must be of equal size. | |
| x & y | bitwise and of x and y. Result is same size as the operands, and |
| operands must be of equal size. | |
| x | y | bitwise or of x and y. Result is same size as the operands, and |
| operands must be of equal size. | |
| x ˆ y | bitwise exclusive-OR of x and y. Result is same size as the |
| operands, and operands must be of equal size. | |
| ˜x | bitwise inversion of x. Result is same size as the operand. |
| x = y | two's complement equality comparison between x and y. Result |
| is a single bit, and operands must be of equal size. | |
| x ≠ y | two's complement inequality comparison between x and y. |
| Result is a single bit, and operands must be of equal size. | |
| x < y | two's complement less than comparison between x and y. Result |
| is a single bit, and operands must be of equal size. | |
| x ≧ y | two's complement greater than or equal comparison between x |
| and y. Result is a single bit, and operands must be of equal size. | |
| {square root over (x)} | floating-point square root of x |
| x || y | concatenation of bit field x to left of bit field y |
| x y | binary digit x repeated, concatenated y times. Size of result is y. |
| x y | extraction of bit y (using little-endian bit numbering) from |
| value x. Result is a single bit. | |
| x y . . . z | extraction of bit field formed from bits y through z of value x. |
| Size of result is y − z + 1; if z > y, result is an empty string, | |
| x?y:z | value of y, if x is true, otherwise value of z. Value of x is a |
| single bit. | |
| x | bitwise assignment of x to value of y |
| Sn | signed, two's complement, binary data format of n bytes |
| Un | unsigned binary data format of n bytes |
| Fn | floating-point data format of n bytes |
The ordering of bits in this document is always little-endian, regardless of the ordering of bytes within larger data structures. Thus, the least-significant bit of a data structure is always labeled 0 (zero), and the most-significant bit is labeled as the data structure size (in bits) minus one.
Memory
Zeus memory is an array of 2 64 bytes, without a specified byte ordering, which is physically distributed among various components.
Byte
A byte is a single element of the memory array, consisting of 8 bits:
Byte Ordering
Larger data structures are constructed from the concatenation of bytes in either little-endian or big-endian byte ordering. A memory access of a data structure of size s at address i is formed from memory bytes at addresses i through i+s−1. Unless otherwise specified, there is no specific requirement of alignment: it is not generally required that i be a multiple of s. Aligned accesses are preferred whenever possible, however, as they will often require one fewer processor or memory clock cycle than unaligned accesses.
With little-endian byte ordering, the bytes are arranged as:
With big-endian byte ordering, the bytes are arranged as:
Zeus memory is byte-addressed, using either little-endian or big-endian byte ordering. For consistency with the bit ordering, and for compatibility with x86 processors, Zeus uses little-endian byte ordering when an ordering must be selected. Zeus load and store instructions are available for both little-endian and big-endian byte ordering. The selection of byte ordering is dynamic, so that little-endian and big-endian processes, and even data structures within a process, can be intermixed on the processor.
Memory Read/Load Semantics
Zeus memory, including memory-mapped registers, must conform to the following requirements regarding side-effects of read or load operations:
A memory read must have no side-effects on the contents of the addressed memory nor on the contents of any other memory.
Memory Write/Store Semantics
Zeus memory, including memory-mapped registers, must conform to the following requirements regarding side-effects of read or load operations:
A memory write must affect the contents of the addressed memory so that a memory read of the addressed memory returns the value written, and so that a memory read of a portion of the addressed memory returns the appropriate portion of the value written.
A memory write may affect or cause side-effects on the contents of memory not addressed by the write operation, however, a second memory write of the same value to the same address must have no side-effects on any memory; memory write operations must be idempotent.
Zeus store instructions that are weakly ordered may have side-effects on the contents of memory not addressed by the store itself; subsequent load instructions which are also weakly ordered may or may not return values which reflect the side-effects.
Data
Zeus provides eight-byte (64-bit) virtual and physical address sizes, and eight-byte (64-bit) and sixteen-byte (128-bit) data path sizes, and uses fixed-length four-byte (32-bit) instructions. Arithmetic is performed on two's-complement or unsigned binary and ANSI/IEEE standard 754-1985 conforming binary floating-point number representations.
Fixed-Point Data
Bit
A bit is a primitive data element:
Peck
A peck is the catenation of two bits:
Nibble
A nibble is the catenation of four bits:
Byte
A byte is the catenation of eight bits, and is a single element of the memory array:
Doublet
A doublet is the catenation of 16 bits, and is the catenation of two bytes:
Quadlet
A quadlet is the catenation of 32 bits, and is the catenation of four bytes:
Octlet
A octlet is the catenation of 64 bits, and is the catenation of eight bytes:
Hexlet
A hexlet is the catenation of 128 bits, and is the catenation of sixteen bytes:
Triclet
A triclet is the catenation of 256 bits, and is the catenation of thirty-two bytes:
Address
Zeus addresses, both virtual addresses and physical addresses, are octlet quantities.
Floating-Point Data
Zeus's floating-point formats are designed to satisfy ANSI/IEEE standard 754-1985: Binary Floating-point Arithmetic. Standard 754 leaves certain aspects to the discretion of implementers: additional precision formats, encoding of quiet and signaling NaN values, details of production and propagation of quiet NaN values. These aspects are detailed below.
Zeus adds additional half-precision and quad-precision formats to standard 754's single-precision and double-precision formats. Zeus's double-precision satisfies standard 754's precision requirements for a single-extended format, and Zeus's quad-precision satisfies standard 754's precision requirements for a double-extended format.
Each precision format employs fields labeled s (sign), e (exponent), and f (fraction) to encode values that are (1) NaN: quiet and signaling, (2) infinities: (−1)ˆ s ∞, (3) normalized numbers: (−1)ˆ s 2ˆ e-bias (1.f), (4) denormalized numbers: (−1)ˆ s 2ˆ 1-bias (0.f), and (5) zero: (−1)ˆ s 0.
Quiet NaN values are denoted by any sign bit value, an exponent field of all one bits, and a non-zero fraction with the most significant bit set. Quiet NaN values generated by default exception handling of standard operations have a zero sign bit, an exponent field of all one bits, a fraction field with the most significant bit set, and all other bits cleared.
Signaling NaN values are denoted by any sign bit value, an exponent field of all one bits, and a non-zero fraction with the most significant bit cleared.
Infinite values are denoted by any sign bit value, an exponent field of all one bits, and a zero fraction field.
Normalized number values are denoted by any sign bit value, an exponent field that is not all one bits or all zero bits, and any fraction field value. The numeric value encoded is (−1)ˆ s 2ˆ e-bias (1.f). The bias is equal the value resulting from setting all but the most significant bit of the exponent field, half: 15, single: 127, double: 1023, and quad: 16383.
Denormalized number values are denoted by any sign bit value, an exponent field that is all zero bits, and a non-zero fraction field value. The numeric value encoded is (−1)ˆ s 2ˆ 1-bias (0.f).
Zero values are denoted by any sign bit value, and exponent field that is all zero bits, and a fraction field that is all zero bits. The numeric value encoded is (−1)ˆ s 0. The distinction between +0 and −0 is significant in some operations.
Half-Precision Floating-Point
Zeus half precision uses a format similar to standard 754's requirements, reduced to a 16-bit overall format. The format contains sufficient precision and exponent range to hold a 12-bit signed integer.
Single-Precision Floating-Point
Zeus single precision satisfies standard 754's requirements for “single.”
Double-Precision Floating-Point
Zeus double precision satisfies standard 754's requirements for “double.”
Quad-Precision Floating-Point
Zeus quad precision satisfies standard 754's requirements for “double extended,” but has additional fraction precision to use 128 bits.
Zeus Processor
MicroUnity's Zeus processor provides the general-purpose, high-bandwidth computation capability of the Zeus system. Zeus includes high-bandwidth data paths, register files, and a memory hierarchy. Zeus's memory hierarchy includes on-chip instruction and data memories, instruction and data caches, a virtual memory facility, and interfaces to external devices. Zeus's interfaces in the initial implementation are solely the “Super Socket 7” bus, but other implementations may have different or additional interfaces.
Architectural Framework
The Zeus architecture defines a compatible framework for a family of implementations with a range of capabilities. The following implementation-defined parameters are used in the rest of the document in boldface. The value indicated is for MicroUnity's first Zeus implementation.
| Parameter | Interpretation | Value | Range of legal values |
| T | number of execution threads | 4 | 1 ≦ T ≦ 31 |
| CE | log 2 cache blocks in first-level | 9 | 0 ≦ CE ≦ 31 |
| cache | |||
| CS | log 2 cache blocks in first-level | 2 | 0 ≦ CS ≦ 4 |
| cache set | |||
| CT | existence of dedicated tags in | 1 | 0 ≦ CT ≦ 1 |
| first-level cache | |||
| LE | log 2 entries in local TB | 0 | 0 ≦ LE ≦ 3 |
| LB | Local TB based on base | 1 | 0 ≦ LB ≦ 1 |
| register | |||
| GE | log 2 entries in global TB | 7 | 0 ≦ GE ≦ 15 |
| GT | log 2 threads which share a | 1 | 0 ≦ GT ≦ 3 |
| global TB | |||
The first implementation of Zeus uses “socket 7 ” protocols and pinouts.
Instruction
Assembler Syntax
Instructions are specified to Zeus assemblers and other code tools (assemblers) in the syntax of an instruction mnemonic (operation code), then optionally white space (blanks or tabs) followed by a list of operands.
The instruction mnemonics listed in this specification are in upper case (capital) letters, assemblers accept either upper case or lower case letters in the instruction mnemonics. In this specification, instruction mnemonics contain periods (“.”) to separate elements to make them easier to understand; assemblers ignore periods within instruction mnemonics. The instruction mnemonics are designed to be parsed uniquely without the separating periods.
If the instruction produces a register result, this operand is listed first. Following this operand, if there are one or more source operands, is a separator which may be a comma(“,”), equal (“=”), or at-sign (“@”). The equal separates the result operand from the source operands, and may optionally be expressed as a comma in assembler code. The at-sign indicates that the result operand is also a source operand, and may optionally be expressed as a comma in assembler code. If the instruction specification has an equal-sign, an at-sign in assembler code indicates that the result operand should be repeated as the first source operand (for example, “A.ADD.I r4@5” is equivalent to “A.ADD.I r4=r4,5”). Commas always separate the remaining source operands.
The result and source operands are case-sensitive; upper case and lower case letters are distinct. Register operands are specified by the names r0 (or r00) through r63 (a lower case “r” immediately followed by a one or two digit number from 0 to 63), or by the special designations of “lp” for “r0,” “dp” for “r1,” “fp” for “r62,” and “sp” for “r63.” Integer-valued operands are specified by an optional sign (−) or (+) followed by a number, and assemblers generally accept a variety of integer-valued expressions.
Instruction Structure
A Zeus instruction is specifically defined as a four-byte structure with the little-endian ordering shown below. It is different from the quadlet defined above because the placement of instructions into memory must be independent of the byte ordering used for data structures. Instructions must be aligned on four-byte boundaries; in the diagram below, i must be a multiple of 4.
Gateway
A Zeus gateway is specifically defined as an 8-byte structure with the little-endian ordering shown below. A gateway contains a code address used to securely invoke a system call or procedure at a higher privilege level. Gateways are marked by protection information specified in the TB. Gateways must be aligned on 8-byte boundaries; in the diagram below, i must be a multiple of 8.
The gateway contains two data items within its structure, a code address and a new privilege level:
The virtual memory system can be used to designate a region of memory as containing gateways. Other data may be placed within the gateway region, provided that if an attempt is made to use the additional data as a gateway, that security cannot be violated. For example, 64-bit data or stack pointers which are aligned to at least 4 bytes and are in little-endian byte order have pl=0, so that the privilege level cannot be raised by attempting to use the additional data as a gateway.
User State
The user state consists of hardware data structures that are accessible to all conventional compiled code. The Zeus user state is designed to be as regular as possible, and consists only of the general registers, the program counter, and virtual memory. There are no specialized registers for condition codes, operating modes, rounding modes, integer multiple/divide, or floating-point values.
General Registers
Zeus user state includes 64 general registers. All are identical; there is no dedicated zero-valued register, and there are no dedicated floating-point registers.
Some Zeus instructions have 64-bit register operands. These operands are sign-extended to 128 bits when written to the register file, and the low-order 64 bits are chosen when read from the register file.
Definition
| def val | |
| case size of | |
| 64: | |
| val | |
| 128: | |
| val | |
| endcase | |
| enddef | |
| def RegWrite(rn, size, val) | |
| case size of | |
| 64: | |
| REG[rn] | |
| 128: | |
| REG[rn] | |
| endcase | |
| enddef | |
The program counter contains the address of the currently executing instruction. This register is implicitly manipulated by branch instructions, and read by branch instructions that save a return address in a general register.
Privilege Level
The privilege level register contains the privilege level of the currently executing instruction. This register is implicitly manipulated by branch gateway and branch down instructions, and read by branch gateway instructions that save a return address in a general register.
Program Counter and Privilege Level
The program counter and privilege level may be packed into a single octlet. This combined data structure is saved by the Branch Gateway instruction and restored by the Branch Down instruction.
System State
The system state consists of the facilities not normally used by conventional compiled code. These facilities provide mechanisms to execute such code in a fully virtual environment. All system state is memory mapped, so that it can be manipulated by compiled code.
Fixed-Point
Zeus provides load and store instructions to move data between memory and the registers, branch instructions to compare the contents of registers and to transfer control from one code address to another, and arithmetic operations to perform computation on the contents of registers, returning the result to registers.
Load and Store
The load and store instructions move data between memory and the registers. When loading data from memory into a register, values are zero-extended or sign-extended to fill the register. When storing data from a register into memory, values are truncated on the left to fit the specified memory region.
Load and store instructions that specify a memory region of more than one byte may use either little-endian or big-endian byte ordering: the size and ordering are explicitly specified in the instruction. Regions larger than one byte may be either aligned to addresses that are an even multiple of the size of the region or of unspecified alignment: alignment checking is also explicitly specified in the instruction.
Load and store instructions specify memory addresses as the sum of a base general register and the product of the size of the memory region and either an immediate value or another general register. Scaling maximizes the memory space which can be reached by immediate offsets from a single base general register, and assists in generating memory addresses within iterative loops. Alignment of the address can be reduced to checking the alignment of the first general register.
The load and store instructions are used for fixed-point data as well as floating-point and digital signal processing data; Zeus has a single bank of registers for all data types.
Swap instructions provide multithread and multiprocessor synchronization, using indivisible operations: add-swap, compare-swap, multiplex-swap, and double-compare-swap. A store-multiplex operation provides the ability to indivisibly write to a portion of an octlet. These instructions always operate on aligned octlet data, using either little-endian or big-endian byte ordering.
Branch
The fixed-point compare-and-branch instructions provide all arithmetic tests for equality and inequality of signed and unsigned fixed-point values. Tests are performed either between two operands contained in general registers, or on the bitwise and of two operands. Depending on the result of the compare, either a branch is taken, or not taken. A taken branch causes an immediate transfer of the program counter to the target of the branch, specified by a 12-bit signed offset from the location of the branch instruction. A non-taken branch causes no transfer; execution continues with the following instruction.
Other branch instructions provide for unconditional transfer of control to addresses too distant to be reached by a 12-bit offset, and to transfer to a target while placing the location following the branch into a register. The branch through gateway instruction provides a secure means to access code at a higher privilege level, in a form similar to a normal procedure call.
Addressing Operations
A subset of general fixed-point arithmetic operations is available as addressing operations. These include add, subtract, Boolean, and simple shift operations. These addressing operations may be performed at a point in the Zeus processor pipeline so that they may be completed prior to or in conjunction with the execution of load and store operations in a “superspring” pipeline in which other arithmetic operations are deferred until the completion of load and store operations.
Execution Operations
Many of the operations used for Digital Signal Processing (DSP), which are described in greater detail below, are also used for performing simple scalar operations. These operations perform arithmetic operations on values of 8-, 16-, 32-, 64-, or 128-bit sizes, which are right-aligned in registers. These execution operations include the add, subtract, boolean and simple shift operations which are also available as addressing operations, but further extend the available set to include three-operand add/subtract, three-operand boolean, dynamic shifts, and bit-field operations.
Floating-Point
Zeus provides all the facilities mandated and recommended by ANSI/IEEE standard 754-1985: Binary Floating-point Arithmetic, with the use of supporting software.
Branch Conditionally
The floating-point compare-and-branch instructions provide all the comparison types required and suggested by the IEEE floating-point standard. These floating-point comparisons augment the usual types of numeric value comparisons with special handling for NaN (not-a-number) values. A NaN value compares as “unordered” with respect to any other value, even that of an identical NaN value.
Zeus floating-point compare-branch instructions do not generate an exception on comparisons involving quiet or signaling NaN values. If such exceptions are desired, they can be obtained by combining the use of a floating-point compare-set instruction, with either a floating-point compare-branch instruction on the floating-point operands or a fixed-point compare-branch on the set result.
Because the less and greater relations are anti-commutative, one of each relation that differs from another only by the replacement of an L with a G in the code can be removed by reversing the order of the operands and using the other code. Thus, an L relation can be used in place of a G relation by swapping the operands to the compare-branch or compare-set instruction.
No instructions are provided that branch when the values are unordered. To accomplish such an operation, use the reverse condition to branch over an immediately following unconditional branch, or in the case of an if-then-else clause, reverse the clauses and use the reverse condition.
The E relation can be used to determine the unordered condition of a single operand by comparing the operand with itself.
The following floating-point compare-branch relations are provided as instructions:
| compare-branch relations | |||||||
| Branch taken | |||||||
| Mnemonic | if values compare as: | Exception if | |||||
| code | C-like | Unordered | Greater | Less | Equal | unordered | invalid |
| E | == | F | F | F | T | no | no |
| LG | <> | F | T | T | F | no | no |
| L | < | F | F | T | F | no | no |
| GE | >= | F | T | F | T | no | no |
The compare-set floating-point instructions provide all the comparison types supported as branch instructions. Zeus compare-set floating-point instructions may optionally generate an exception on comparisons involving quiet or signaling NaNs.
The following floating-point compare-set relations are provided as instructions:
| compare-set relations | |||||||
| Mnemonic | Result if values compare as: | Exception if | |||||
| code | C-like | Unordered | Greater | Less | Equal | unordered | invalid |
| E | == | F | F | F | T | no | no |
| LG | <> | F | T | T | F | no | no |
| L | < | F | F | T | F | no | no |
| GE | >= | F | T | F | T | no | no |
| E.X | == | F | F | F | T | no | yes |
| LG.X | <> | F | T | T | F | no | yes |
| L.X | < | F | F | T | F | yes | yes |
| GE.X | <= | F | T | F | T | yes | yes |
The basic operations supported in hardware are floating-point add, subtract, multiply, divide, square root and conversions among floating-point formats and between floating-point and binary integer formats.
Software libraries provide other operations required by the ANSI/IEEE floating-point standard.
The operations explicitly specify the precision of the operation, and round the result (or check that the result is exact) to the specified precision at the conclusion of each operation. Each of the basic operations splits operand registers into symbols of the specified precision and performs the same operation on corresponding symbols.
In addition to the basic operations, Zeus performs a variety of operations in which one or more products are summed to each other and/or to an additional operand. The instructions include a fused multiply-add (E.MUL.ADD.F), convolve (E.CON.F), matrix multiply (E.MUL.MAT.F), and scale-add (E.SCAL.ADD.F).
The results of these operations are computed as if the multiplies are performed to infinite precision, added as if in infinite precision, then rounded only once. Consequently, these operations perform these operations with no rounding of intermediate results that would have limited the accuracy of the result.
NaN Handling
ANSI/IEEE standard 754-1985 specifies that operations involving a signaling NaN or invalid operation shall, if no trap occurs and if a floating-point result is to be delivered, deliver a quiet NaN as its result. However, it fails to specify what quiet NaN value to deliver.
Zeus operations that produce a floating-point result and do not trap on invalid operations propagate signaling NaN values from operands to results, changing the signaling NaN values to quiet NaN values by setting the most significant fraction bit and leaving the remaining bits unchanged. Other causes of invalid operations produce the default quiet NaN value, where the sign bit is zero, the exponent field is all one bits, the most significant fraction bit is set and the remaining fraction bits are zero bits. For Zeus operations that produce multiple results catenated together, signaling NaN propagation or quiet NaN production is handled separately and independently for each result symbol.
ANSI/IEEE standard 754-1985 specifies that quiet NaN values should be propagated from operand to result by the basic operations. However, it fails to specify which of several quiet NaN values to propagate when more than one operand is a quiet NaN. In addition, the standard does not clearly specify how quiet NaN should be propagated for the multiple-operation instructions provided in Zeus. The standard does not specify the quiet NaN produced as a result of an operand being a signaling NaN when invalid operation exceptions are handled by default. The standard leaves unspecified how quiet and signaling NaN values are propagated though format conversions and the absolute-value, negate and copy operations. This section specifies these aspects left unspecified by the standard.
First of all, for Zeus operations that produce multiple results catenated together, quiet and signaling NaN propagation is handled separately and independently for each result symbol. A quiet or signaling NaN value in a single symbol of an operand causes only those result symbols that are dependent on that operand symbol's value to be propagated as that quiet NaN. Multiple quiet or signaling NaN values in symbols of an operand which influence separate symbols of the result are propagated independently of each other. Any signaling NaN that is propagated has the high-order fraction bit set to convert it to a quiet NaN.
For Zeus operations in which multiple symbols among operands upon which a result symbol is dependent are quiet or signaling NaNs, a priority Rule will determine which NaN is propagated. Priority shall be given to the operand that is specified by a register definition at a lower-numbered (little-endian) bit position within the instruction (rb has priority over rc, which has priority over rd). In the case of operands which are catenated from two registers, priority shall be assigned based on the register which has highest priority (lower-numbered bit position within the instruction). In the case of tie (as when the E.SCAL.ADD scaling operand has two corresponding NaN values, or when a E.MUL.CF operand has NaN values for both real and imaginary components of a value), the value which is located at a lower-numbered (little-endian) bit position within the operand is to receive priority. The identification of a NaN as quiet or signaling shall not confer any priority for selection—only the operand position, though a signaling NaN will cause an invalid operand exception.
The sign bit of NaN values propagated shall be complemented if the instruction subtracts or negates the corresponding operand or (but not and) multiplies it by or divides it by or divides it into an operand which has the sign bit set, even if that operand is another NaN. If a NaN is both subtracted and multiplied by a negative value, the sign bit shall be propagated unchanged.
For Zeus operations that convert between two floating-point formats (INFLATE and DEFLATE), NaN values are propagated by preserving the sign and the most-significant fraction bits, except that the most-significant bit of a signalling NaN is set and (for DEFLATE) the least-significant fraction bit preserved is combined, via a logical-or of all fraction bits not preserved. All additional fraction bits (for INFLATE) are set to zero.
For Zeus operations that convert from a floating-point format to a fixed-point format (SINK), NaN values produce zero values (maximum-likelihood estimate). Infinity values produce the largest representable positive or negative fixed-point value that fits in the destination field. When exception traps are enabled, NaN or Infinity values produce a floating-point exception. Underflows do not occur in the SINK operation, they produce −1, 0 or +1, depending on rounding controls.
For absolute-value, negate, or copy operations, NaN values are propagated with the sign bit cleared, complemented, or copied, respectively. Signalling NaN values cause the Invalid operation exception, propagating a quieted NaN in corresponding symbol locations (default) or an exception, as specified by the instruction.
Floating-Point Functions
Referring to FIG. 37, the following functions are defined for use within the detailed instruction definitions in the following section. In these functions an internal format represents infinite-precision floating-point values as a four-element structure consisting of (1) s (sign bit): 0 for positive, 1 for negative, (2) t (type): NORM, ZERO, SNAN, QNAN, INFINITY, (3) e (exponent), and (4) f: (fraction). The mathematical interpretation of a normal value places the binary point at the units of the fraction, adjusted by the exponent: (−1)ˆ s *(2ˆ e )*f. The function F converts a packed IEEE floating-point value into internal format. The function PackF converts an internal format back into IEEE floating-point format, with rounding and exception control.
Digital Signal Processing
The Zeus processor provides a set of operations that maintain the fullest possible use of 128-bit data paths when operating on lower-precision fixed-point or floating-point vector values. These operations are useful for several application areas, including digital signal processing, image processing and synthetic graphics. The basic goal of these operations is to accelerate the performance of algorithms that exhibit the following characteristics:
Low-Precision Arithmetic
The operands and intermediate results are fixed-point values represented in no greater than 64 bit precision. For floating-point arithmetic, operands and intermediate results are of 16, 32, or 64 bit precision.
The fixed-point arithmetic operations include add, subtract, multiply, divide, shifts, and set on compare.
The use of fixed-point arithmetic permits various forms of operation reordering that are not permitted in floating-point arithmetic. Specifically, commutativity and associativity, and distribution identities can be used to reorder operations. Compilers can evaluate operations to determine what intermediate precision is required to get the specified arithmetic result.
Zeus supports several levels of precision, as well as operations to convert between these different levels. These precision levels are always powers of two, and are explicitly specified in the operation code.
When specified, add, subtract, and shift operations may cause a fixed-point arithmetic exception to occur on resulting conditions such as signed or unsigned overflow. The fixed-point arithmetic exception may also be invoked upon a signed or unsigned comparison.
Sequential Access to Data
The algorithms are or can be expressed as operations on sequentially ordered items in memory. Scatter-gather memory access or sparse-matrix techniques are not required.
Where an index variable is used with a multiplier, such multipliers must be powers of two. When the index is of the form: nx+k, the value of n must be a power of two, and the values referenced should have k include the majority of values in the range 0 . . . n−1. A negative multiplier may also be used.
Vectorizable Operations
The operations performed on these sequentially ordered items are identical and independent. Conditional operations are either rewritten to use Boolean variables or masking, or the compiler is permitted to convert the code into such a form.
Data-Handling Operations
The characteristics of these algorithms include sequential access to data, which permit the use of the normal load and store operations to reference the data. Octlet and hexlet loads and stores reference several sequential items of data, the number depending on the operand precision.
The discussion of these operations is independent of byte ordering, though the ordering of bit fields within octlets and hexlets must be consistent with the ordering used for bytes. Specifically, if big-endian byte ordering is used for the loads and stores, the figures below should assume that index values increase from left to right, and for little-endian byte ordering, the index values increase from right to left. For this reason, the figures indicate different index values with different shades, rather than numbering.
When an index of the nx+k form is used in array operands, where n is a power of 2, data memory sequentially loaded contains elements useful for separate operands. The “shuffle” instruction divides a triclet of data up into two hexlets, with alternate bit fields of the source triclet grouped together into the two results. An immediate field, h, in the instruction specifies which of the two regrouped hexlets to select for the result. For example, two X.SHUFFLE.256 rd=rc,rb,32,128,h operations rearrange the source triclet (c,b) into two hexlets as in FIG. 54A.
In the shuffle operation, two hexlet registers specify the source triclet, and one of the two result hexlets are specified as hexlet register.
The example above directly applies to the case where n is 2. When n is larger, shuffle operations can be used to further subdivide the sequential stream. For example, when n is 4, we need to deal out 4 sets of doublet operands, as shown in FIG. 54B (An example of the use of a four-way deal is a digital signal processing application such as conversion of color to monochrome).
When an array result of computation is accessed with an index of the form nx+k, for n a power of 2, the reverse of the “deal” operation needs to be performed on vectors of results to interleave them for storage in sequential order. The “shuffle” operation interleaves the bit fields of two octlets of results into a single hexlet. For example a X. SHUFFLE.16 operation combines two octlets of doublet fields into a hexlet as shown in FIG. 54C.
For larger values of n, a series of shuffle operations can be used to combine additional sets of fields, similarly to the mechanism used for the deal operations. For example, when n is 4, we need to shuffle up 4 sets of doublet operands, as shown in FIG. 54D (An example of the use of a four-way shuffle is a digital signal processing application such as conversion of monochrome to color).
When the index of a source array operand or a destination array result is negated, or in other words, if of the form nx+k where n is negative, the elements of the array must be arranged in reverse order. The “swizzle” operation can reverse the order of the bit fields in a hexlet. For example, a X.SWIZZLE rd=rc,127,112 operation reverses the doublets within a hexlet as shown in FIG. 47C.
In some cases, it is desirable to use a group instruction in which one or more operands is a single value, not an array. The “swizzle” operation can also copy operands to multiple locations within a hexlet. For example, a X.SWIZZLE 15,0 operation copies the low-order 16 bits to each double within a hexlet.
Variations of the deal and shuffle operations are also useful for converting from one precision to another. This may be required if one operand is represented in a different precision than another operand or the result, or if computation must be performed with intermediate precision greater than that of the operands, such as when using an integer multiply.
When converting from a higher precision to a lower precision, specifically when halving the precision of a hexlet of bit fields, half of the data must be discarded, and the bit fields packed together. The “compress” operation is a variant of the “deal” operation, in which the operand is a hexlet, and the result is an octlet. An arbitrary half-sized sub-field of each bit field can be selected to appear in the result. For example, a selection of bits 19 . . . 4 of each quadlet in a hexlet is performed by the X.COMPRESS rd=rc,16,4 operation as shown in FIG. 43D.
When converting from lower-precision to higher-precision, specifically when doubling the precision of an octlet of bit fields, one of several techniques can be used, either multiply, expand, or shuffle. Each has certain useful properties. In the discussion below, m is the precision of the source operand.
The multiply operation, described in detail below, automatically doubles the precision of the result, so multiplication by a constant vector will simultaneously double the precision of the operand and multiply by a constant that can be represented in m bits.
An operand can be doubled in precision and shifted left with the “expand” operation, which is essentially the reverse of the “compress” operation. For example the X.EXPAND rd=rc,16,4 expands from 16 bits to 32, and shifts 4 bits left as shown in FIG. 54E.
The “shuffle” operation can double the precision of an operand and multiply it by 1 (unsigned only), 2 m or 2 m +1, by specifying the sources of the shuffle operation to be a zeroed register and the source operand, the source operand and zero, or both to be the source operand. When multiplying by 2m, a constant can be freely added to the source operand by specifying the constant as the right operand to the shuffle.
Arithmetic Operations
The characteristics of the algorithms that affect the arithmetic operations most directly are low-precision arithmetic, and vectorizable operations. The fixed-point arithmetic operations provided are most of the functions provided in the standard integer unit, except for those that check conditions. These functions include add, subtract, bitwise Boolean operations, shift, set on condition, and multiply, in forms that take packed sets of bit fields of a specified size as operands. The floating-point arithmetic operations provided are as complete as the scalar floating-point arithmetic set. The result is generally a packed set of bit fields of the same size as the operands, except that the fixed-point multiply function intrinsically doubles the precision of the bit field.
Conditional operations are provided only in the sense that the set on condition operations can be used to construct bit masks that can select between alternate vector expressions, using the bitwise Boolean operations. All instructions operate over the entire octlet or hexlet operands, and produce a hexlet result. The sizes of the bit fields supported are always powers of two.
Galois Field Operations
Zeus provides a general software solution to the most common operations required for Galois Field arithmetic. The instructions provided include a polynomial multiply, with the polynomial specified as one register operand. This instruction can be used to perform CRC generation and checking, Reed-Solomon code generation and checking, and spread-spectrum encoding and decoding.
Software Conventions
The following section describes software conventions that are to be employed at software module boundaries, in order to permit the combination of separately compiled code and to provide standard interfaces between application, library and system software. Register usage and procedure call conventions may be modified, simplified or optimized when a single compilation encloses procedures within a compilation unit so that the procedures have no external interfaces. For example, internal procedures may permit a greater number of register-passed parameters, or have registers allocated to avoid the need to save registers at procedure boundaries, or may use a single stack or data pointer allocation to suffice for more than one level of procedure call.
Register Usage
All Zeus registers are identical and general-purpose; there is no dedicated zero-valued register, and no dedicated floating-point registers. However, some procedure-call-oriented instructions imply usage of registers zero (0) and one (1) in a manner consistent with the conventions described below. By software convention, the non-specific general registers are used in more specific ways.
| register usage | |||
| register | assembler | ||
| number | names | usage | how saved |
| 0 | lp, r0 | link pointer | caller |
| 1 | dp, r1 | data pointer | caller |
| 2-9 | r2-r9 | parameters | caller |
| 10-31 | r10-r31 | temporary | caller |
| 32-61 | r32-r61 | saved | callee |
| 62 | fp, r62 | frame pointer | callee |
| 63 | sp, r63 | stack pointer | callee |
At a procedure call boundary, registers are saved either by the caller or callee procedure, which provides a mechanism for leaf procedures to avoid needing to save registers. Compilers may choose to allocate variables into caller or callee saved registers depending on how their lifetimes overlap with procedure calls.
Procedure Calling Conventions
Procedure parameters are normally allocated in registers, starting from register 2 up to register 9. These registers hold up to 8 parameters, which may each be of any size from one byte to sixteen bytes (hexlet), including floating-point and small structure parameters. Additional parameters are passed in memory, allocated on the stack. For C procedures which use varargs.h or stdarg.h and pass parameters to further procedures, the compilers must leave room in the stack memory allocation to save registers 2 through 9 into memory contiguously with the additional stack memory parameters, so that procedures such as_doprnt can refer to the parameters as an array.
Procedure return values are also allocated in registers, starting from register 2 up to register 9. Larger values are passed in memory, allocated on the stack.
There are several pointers maintained in registers for the procedure calling conventions: lp, sp, dp, fp.
The lp register contains the address to which the callee should return to at the conclusion of the procedure. If the procedure is also a caller, the lp register will need to be saved on the stack, once, before any procedure call, and restored, once, after all procedure calls. The procedure returns with a branch instruction, specifying the lp register.
The sp register is used to form addresses to save parameter and other registers, maintain local variables, i.e., data that is allocated as a LIFO stack. For procedures that require a stack, normally a single allocation is performed, which allocates space for input parameters, local variables, saved registers, and output parameters all at once. The sp register is always hexlet aligned.
The dp register is used to address pointers, literals and static variables for the procedure. The dp register points to a small (approximately 4096-entry) array of pointers, literals, and statically-allocated variables, which is used locally to the procedure. The uses of the dp register are similar to the use of the gp register on a Mips R-series processor, except that each procedure may have a different value, which expands the space addressable by small offsets from this pointer. This is an important distinction, as the offset field of Zeus load and store instructions are only 12 bits. The compiler may use additional registers and/or indirect pointers to address larger regions for a single procedure. The compiler may also share a single dp register value between procedures which are compiled as a single unit (including procedures which are externally callable), eliminating the need to save, modify and restore the dp register for calls between procedures which share the same dp register value.
Load- and store-immediate-aligned instructions, specifying the dp register as the base register, are generally used to obtain values from the dp region. These instructions shift the immediate value by the logarithm of the size of the operand, so loads and stores of large operands may reach farther from the dp register than of small operands. Referring to FIG. 54F, the size of the addressable region is maximized if the elements to be placed in the dp region are sorted according to size, with the smallest elements placed closest to the dp base. At points where the size changes, appropriate padding is added to keep elements aligned to memory boundaries matching the size of the elements. Using this technique, the maximum size of the dp region is always at least 4096 items, and may be larger when the dp area is composed of a mixture of data sizes.
The dp register mechanism also permits code to be shared, with each static instance of the dp region assigned to a different address in memory. In conjunction with position-independent or pc-relative branches, this allows library code to be dynamically relocated and shared between processes.
To implement an inter-module (separately compiled) procedure call, the lp register is loaded with the entry point of the procedure, and the dp register is loaded with the value of the dp register required for the procedure. These two values are located adjacent to each other as a pair of octlet quantities in the dp region for the calling procedure. For a statically-linked inter-module procedure call, the linker fills in the values at link time. However, this mechanism also provides for dynamic linking, by initially filling in the lp and dp fields in the data structure to invoke the dynamic linker. The dynamic linker can use the contents of the lp and/or dp registers to determine the identity of the caller and callee, to find the location to fill in the pointers and resume execution. Specifically, the lp value is initially set to point to an entry point in the dynamic linker, and the dp value is set to point to itself: the location of the lp and dp values in the dp region of the calling procedure. The identity of the procedure can be discovered from a string following the dp pointer, or a separate table, indexed by the dp pointer.
The fp register is used to address the stack frame when the stack size varies during execution of a procedure, such as when using the GNU C alloca function. When the stack size can be determined at compile time, the sp register is used to address the stack frame and the fp register may be used for any other general purpose as a callee-saved register.
Typical Static-Linked, Intra-Module Calling Sequence:
| caller (non-leaf): | |||
| caller: | A.ADDI | sp@-size | // allocate caller stack frame |
| S.I.64.A | lp,sp,off | // save original lp register | |
| ... (callee using same dp as caller) | |||
| B.LINK.I | callee | ||
| ... | |||
| ... (callee using same dp as caller) | |||
| B.LINK.I | callee | ||
| ... | |||
| L.I.64.A | lp=sp,off | // restore original lp register | |
| A.ADDI | sp@size | // deallocate caller stack frame | |
| B | lp | // return |