Plaque It!
Sponsored by: Flash of Genius |
[0001] The present invention is related to the following application, which is filed on even date herewith, and which is incorporated herein by reference:
[0002] U.S. Pat. Appl. Serial No.______ , filed Dec. 14, 2001, entitled “NODE TRANSLATION AND PROTECTION IN A CLUSTERED MULTIPROCESSOR SYSTEM” (Atty. Docket No. 499.710US1).
[0003] The present invention generally relates to the field of high-speed digital data processing systems and, more particularly, relates to a method of remote address translation which takes place at a user-specified remote node within a multiprocessor network.
[0004] Multiprocessor computer systems comprise a number of processing element nodes connected together by an interconnect network. Typically, each processing element node includes at least one processor, a local memory, and an interface circuit connecting the processing element node to the interconnect network. The interconnect network is used for transmitting packets of information or messages between the processing element nodes.
[0005] Distributed shared memory multiprocessor systems include a number of processing element nodes which share a distributed memory and are located within a single machine. By increasing the number of processing element nodes, or the number of processors within each node, such systems can often be scaled to handle increased demand. In such a system, each processor can directly access all of memory, including its own local memory and the memory of the other (remote) processing element nodes. Typically, the virtual address used for all memory accesses within a distributed shared memory multiprocessor system is translated to a physical address in the requesting processor's translation-lookaside buffer (“TLB”). Thus, the requesting processor's TLB will need to contain address translation information for all of the memory that the processor accesses within the machine, which includes both local and remote memory. This amount of address translation information can be substantial, and can result in much duplication of translation information throughout the multiprocessor system (e.g., if the same page of memory is accessed by 64 different processors, the TLB used by each processor will need to contain an entry for that page).
[0006] Some multiprocessor systems employ block transfer engines to transfer blocks of data from one area of memory to another area of memory. Block transfer engines provide several advantages, such as asynchronous operation (i.e., by operating without further processor involvement after being initially kicked off by the processor, block transfer engines free up the processor to perform other tasks) and faster transfer performance than could be achieved by the processor (e.g., since block transfer engines do not use processor-generated cachable references, there is less overhead on the coherence protocol of-the read-modify-write cycle, and cache blowouts can be avoided).
[0007] Unfortunately, existing block transfer engines suffer from problems that limit their utility. For example, since address translations are performed in on-chip TLBs at the requesting processors, external block transfer engines are prevented from being programmed using virtual addresses. Instead, with existing block transfer engines, user software makes an operating system (OS) call to inform the OS that it wants to transfer a particular length of data from a particular source (specified by its virtual address) to a particular destination (also specified by its virtual address). In response, the OS first checks whether it has address translations for all of the virtual addresses, and then generates separate block-transfer requests for each physical page. For example, if the virtual address range spans 15 physical pages, an OS may have to generate 15 separate queued block-transfer requests to cause 15 separate physical transfers. The large amount of overhead associated with such OS intervention means that much of the advantage that is associated with performing the block transfer in the first place is lost.
[0008] Clustered multiprocessor systems include collections of processing machines, with each processing machine including a single processor system or distributed shared memory multiprocessor system. Clustering advantageously limits the scaling required of a single OS, and provides fault containment if one of the machines should suffer a hardware or OS error. In a clustered system, however, memory accesses to remote machines are typically performed via a network interface I/O device that requires OS intervention to send messages, and can target only specific memory buffers that were reserved for this communication at the remote machine. Thus, memory must be specifically “registered” by a user process on the remote machine, which prevents the memory on the remote machine from being accessed arbitrarily. Also, state must be set up on the remote machine to direct the incoming data, or the OS on the remote machine must intervene to handle the data, copying the data at least once. More recently, some network interface cards have been designed to support user-level communication using the VIA, ST or similar “OS bypass” interface. Such approaches, while successful in avoiding OS intervention on communication events, do not unify local and remote memory accesses. Thus, programs must use different access mechanisms for intra-machine and inter-machine communication.
[0009] Thus, there is a need for an address translation mechanism for a multiprocessor system which decreases the amount of address translation information needed to perform local and remote memory accesses, and reduces TLB pressure. There is also a need for an address translation mechanism for a multiprocessor system which decreases the duplication of address translation information throughout the system. There is a further need for an address translation mechanism for a multiprocessor system which is accessible to an external block transfer engine, allows that engine to be directly programmed by user processes using virtual addresses, and performs translations with minimal TLB pressure. There is also a need for a block transfer engine for a multiprocessor system which can perform block transfers without being penalized by the high overhead associated with prior systems. In addition, there is a need for an address translation mechanism for a clustered multiprocessor system which allows access to arbitrary memory on a remote machine, with the remote memory protected by the translation mechanism. Further, there is a need for an address translation mechanism which allows a user process to communicate with other nodes in the local machine, and other nodes in a remote machine, using the same interface.
[0010] According to one aspect of the invention, a method of performing remote address translation in a multiprocessor system includes determining a virtual address at a local node, accessing a local connection table at the local node to produce a system node identifier for a remote node, communicating the virtual address to the remote node, and translating the virtual address to a physical address at the remote node. The translation may include matching the virtual address with an entry of a translation-lookaside buffer at the remote node, and may also use a remote address space number as a qualification or validation for the match.
[0011] According to another aspect of the invention, a mechanism for performing a memory access operation in a multiprocessor system uses a communication engine located in a local processing element node and a translation-lookaside buffer located in a remote processing element node. The communication engine is programmable by a user process to perform a user-level memory access operation using a user-specified virtual address. The translation-lookaside buffer receives the virtual address from the communication engine and translates the virtual address to a physical address, wherein the physical address is used to perform the memory access operation. The memory access operation may involve a block transfer, atomic memory operation, an immediate data send or a scalar fill.
[0012] Other aspects of the invention will be apparent upon reading the following detailed description of the invention and viewing the drawings that form a part thereof.
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019] In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that the embodiments may be combined, or that other embodiments may be utilized and that structural, logical and electrical changes may be made without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and their equivalents.
[0020] Referring to
[0021] Referring to
[0022] Each processor
[0023] In one embodiment, processors
[0024] As shown in FIGS.
[0025] As also shown in FIGS.
[0026] Referring to
[0027] In one embodiment, all of the coherence information is passed across the bus in the form of messages, and each processor on the bus “snoops” by monitoring the addresses on the bus and, if it finds the address of data within its own cache, invalidating that cache entry. Other cache coherence schemes can be used as well. Memory/directory interface
[0028] Each SHUB
[0029] CE
[0030] Referring to
[0031] Each CI
[0032] In one embodiment, each CI
[0033] Referring again to
[0034] A user process running on either local processor
[0035] The one or more CI_DATA registers are used to hold immediate data for AMOs and for immediate data sends in an immediate mode of operation (i.e., where a user process supplies the data to be sent to target memory). In one embodiment, the CI_DATA registers hold up to 11 64-bit words of immediate data that may be sent. In other embodiments, the CI_DATA registers may hold less or more immediate data. The CI_SOURCE_VA and CI_DEST_VA registers hold virtual addresses for an (optional) source memory buffer and a destination memory buffer, respectively. The source memory buffer is optional since it is not needed in the immediate mode of operation (since the data being sent is stored in the CI_DATA registers). These virtual addresses are combined with connection descriptors from the CI_COMMAND register to fully specify the memory addresses, with each of the connection descriptors being defined as a handle for the system node identification (SNID) and the address space number (ASN) used to translate the corresponding virtual address.
[0036] The fields of the CI_COMMAND register are shown at reference numeral
[0037] For a scalar fill (i.e., TYPE=1), the memory block specified by the CI_DEST_VA register is filled with a scalar value from the CI_DATA[0] register. For byte and sword (i.e., single 32-bit word) scalar fills, the scalar is taken from the lower bits of the 64-bit CI_DATA[0] register, and the destination address must be naturally aligned. Scalar fills are used, for example, to zero out an area of memory or fill an array with a constant value.
[0038] For an immediate data send (i.e., TYPE=2), the source data is taken from the CI_DATA registers, starting at CI_DATA[0], rather than from memory. For byte and sword transfers, the CI_DATA registers are interpreted as little endian (i.e., byte 0 is stored in bits 7 . . . 0 of CI_DATA[0], etc.). The length of the data transfer is limited to the size of the CI_DATA registers (e.g., <=88 bytes (i.e., <=11 dwords)), and the destination address for the immediate data send must be naturally aligned for the specified data element size.
[0039] For an AMO (i.e., TYPE=3), the command is taken from the AMO_CMD field of the CI_COMMAND register (the SRC_CD and AMO_CMD fields may share the same space in the CI_COMMAND register since their use is mutually exclusive). The AMO_CMD field is used to write a code that specifies, for example, one or more of the following AMO operations: 0=AtomicAdd; 1=MaskedStore; 4=FetchAdd; 5=MaskedSwap; and 6=CompSwap. Each AMO operation uses one or two immediate operands, which are taken from the CI_DATA[0] and CI_DATA[1 ] registers, and some of the AMO operations return the previous value from memory in the CI_RESULT register. The address for the AMO is specified by the CI_DEST_VA register, which may be 64-bit aligned. For example, the AtomicAdd operation atomically adds the integer addend (in CI_DATA[0]) to the value in memory, and does not return a result. The MaskedStore operation atomically writes those bits of the StoreValue (in CI_DATA[0]) corresponding to zero bits in the Mask (in CI_DATA[1]) into memory, leaves the other bits of the memory value unchanged, and does not return a result. The FetchAdd operation atomically adds the integer Addend (in CI_DATA[0]) to the value in memory, and returns the old value of the memory (in CI_-RESULT). The MaskedSwap operation atomically writes those bits of the Swaperand (in CI_DATA[0]) corresponding to zero bits in the Mask (in CI_DATA[1]) into memory, leaves other bits of the memory value unchanged, and returns the old value (in CI_RESULT). The CompSwap operation compares the Comperand (in CI_DATA[0]) to the value in memory and, if equal, stores the Swaperand in memory, while returning the old value of the memory (in CI_RESULT). These AMO operations are exemplary, and any combination of these and other AMOs may be supported. AMOs are performed on arbitrary, aligned 64-bit words in memory, and are kept coherent with other references.
[0040] The INTERRUPT field of the CI_COMMAND, if set to 1, causes a processor interrupt to be generated upon completion of a request. In response to the interrupt, which may be used with any type of CI request, the processor can call a user-provided interrupt handler. If the INTERRUPT field is cleared (i.e., 0), then no interrupt will be generated.
[0041] If the PHYSICAL field is set to 1 and the CI is set to kernel mode, then the block transfer or AMO takes place using untranslated physical addresses. In this case, the connection descriptors from the SRC_CD and DEST_CD fields of the CI_COMMAND register are ignored, and the CI_SOURCE_VA and CI_DEST_VA registers are interpreted by CE
[0042] If the MERGE field is set to 1, then the present CI request will be merged with the next request. If cleared to 0, then the present CI request is interpreted as a stand-alone request or the last of a series of merged requests, in which case the present request must completely finish before CE
[0043] The SRC_CD field is used by a user process to specify a connection descriptor (CD) for the (optional) source virtual address for a memory copy request. The DEST_CD field is used by a user process to specify a connection descriptor (CD) for the destination virtual address. The operation of the SRC_CD and DEST_CD fields is discussed below.
[0044] To write a request directly into the CI registers, user software (e.g., library code) writes either the CI_DATA register(s) or CI_-SOURCE_VA register (depending on the type of request) and the CI_DEST_VA register, and then writes the CI_COMMAND register. Writing the CI_COMMAND register causes the block transfer or AMO operation to be initiated. Upon completion of a CI request, state is saved in the CI_STATUS register, and the CI_RESULT register for certain AMO operations (as described above). No completion status is returned for requests with the MERGE bit set. The CI_STATUS register, and its use, are described below in relation to the CI request flow control and completion status.
[0045] As an alternative to directly programming a CI by writing a transfer request directly to its user-accessible MMRs to initiate a block transfer or AMO, in one embodiment user software (e.g., library code) writes a transfer descriptor to a memory-resident transfer descriptor queue, and then notifies CE
[0046] In order to use a memory-resident transfer descriptor queue, the CI_MEMQ_DEF register is first written to define the base virtual address and size of the queue. This register includes a BASE field and a SIZE field. The BASE field specifies the virtual address of the base of the transfer descriptor queue. This virtual address is translated using the same address translation mechanism used for the transfers themselves (as described below), with CE
[0047] For each CI
[0048] To add a new transfer descriptor to the queue, user software increments the tail pointer by 1, modulo the queue SIZE, writes the transfer descriptor into the memory-resident transfer descriptor queue at the new tail index location, performs a memory barrier, and then writes the updated tail pointer into the CI_MEMQ_TAIL register. Updating the CI_MEMQ_TAIL register notifies CE
[0049] The structure of each memory-resident transfer descriptor is similar in structure to the user-accessible MMRs of each CI
[0050] Each CI
[0051] The current state of each CI
[0052] The read-only HEAD field is the current head pointer for the memory-resident transfer descriptor queue. The current head pointer is initialized to 0 on a write to the CI_MEMQ_DEF register, and is incremented by one, modulo the queue size, as CE
[0053] If a CI
[0054] The READY bit is cleared to 0 by CE
[0055] If the MERGE bit is set in a request command, the CI will be ready to process the next request (which will be treated as a continuation of the current request) as soon as the request packets for the current request have been sent. In this case, if some of the request packets require retransmission due to receiving a negative acknowledgement (i.e., “nack”), these request packets may be interleaved with the new request packets from a new request.
[0056] The completion status for a directly-programmed request is reported by a code in the STATUS field of the CI_STATUS register. The request completion codes include: 0=NO_STATUS (no status; request may be currently processing); 1=SUCCESS (successful completion); 2=PROCESSED (for a request with the MERGE bit set, this bit indicates only that CE
[0057] Upon completing a request from the transfer descriptor queue, CE
[0058] The VIA specification for an interface for cluster-based computing requires that queued transfers for a given connection not be performed after a transfer fails. To provide support for this requirement, the SHUTDOWN bit will be set in the CI_STATUS register if a failure associated with a given connection occurs, and CE
[0059] As noted above, each CI
[0060] Although CE
[0061] In order to perform a block transfer or AMO, CE
[0062] 1. A user process running at a local node provides a “connection descriptor” (CD) and a virtual address (with the “connection descriptor” being defined as a handle that specifies the endpoint node of a virtual connection);
[0063] 2. The CD indexes into a protected “local connection table” (LCT) to produce an endpoint SHUB number (i.e., the local SHUB number if the connection endpoint is the local node, or a remote SHUB number if the connection endpoint is a remote node) and an associated address space number (ASN);
[0064] 3. The virtual address and ASN are communicated to the endpoint SHUB identified by the endpoint SHUB number (i.e., the local SHUB if endpoint is the local node, or the remote SHUB if the endpoint is the remote node);
[0065] 4.The virtual address is translated to a physical address on the endpoint SHUB, as qualified using the ASN (note that the physical address may be mapped to anywhere in the coherence domain of the endpoint SHUB).
[0066] By performing virtual-to-physical address translations at the remote end (i.e., when the CD specifies that the endpoint of the virtual connection is the remote node), the amount of context that the local SHUB TLB needs to hold will be significantly reduced, which will result in a corresponding improvement in SHUB TLB hit rates. The memory management between the coherence domains of a clustered system is also advantageously decoupled.
[0067] The address translation mechanism used by CE
[0068] Referring again to
[0069] A CD is a handle that specifies the endpoint of a virtual connection. The CD is obtained from the OS by a local user process (e.g., which makes an OS call to set up the CD), and points the process to an address space, thereby providing a handle through which the process can access that space. The CD provides translation and protection for accessing the connection (analogous to the manner in which a file descriptor is used to access an open device). Once initially obtained from the OS, a user process can use the CD as part of a communication request to CE
[0070] The OS maintains a LCT for each user process, which is indexed by the CD. In one embodiment, the LCT is a memory-based table with a plurality of 32-bit entries, and the CD is an offset into the LCT. The LCT, which is set up by the OS and is not writeable by the user, provides extra information which is added to the virtual address provided by the user to allow the endpoint or target SHUB to perform address translations. The extra information provided by each LCT entry
[0071] As described above, each entry of the LCT contains the node number of the connection endpoint (i.e., the SNID), which may be local or on a different machine, and a key which is used for validation at the specified node. The key is sent by hardware to the specified node with the memory request, and is used as part of the remote translation step at the specified node. The key can be used to qualify a translation in a general purpose TLB. Alternatively, the key may be used to access a specific memory translation table, if memory associated with the particular connection has been registered and pinned down.
[0072] The first entry of the LCT (i.e., LCT[0]), pointed to by CD zero, is a special case. In particular, the SNID of LCT[0] always points to the local SHUB, and the ASN of LCT[0] is set to the ASN of the local process' virtual address space. This special case allows CD zero to be used to translate accesses to the CI's memory-resident transfer descriptor queue. CD zero is also often used as the source and/or destination CD for memory transfers since at least one of the source and the destination memory is often local. For example, a local block copy operation (i.e., a block copy operation from one local memory location to another) may use CD zero to specify both the source and destination.
[0073] In one embodiment, a local connection cache (LCC) is provided on the SHUB to maintain a cache of LCT entries. The LCC is hardware-reloaded from the current virtual interface's LCT as necessary. If an access of the LCC does not result in a hit, a BAD_CD error occurs. The entries in the LCC may include a communication interface number (CIN), which must match the CIN associated with the accessing CI to result in a cache hit.
[0074] To support local address translations, each SHUB contains a translation-lookaside buffer (TLB)
[0075] The VALID field of each TLB entry indicates whether that entry is valid (0=no; 1=yes). The RW field indicates whether the destination memory location is write-protected (0=read only; 1=read/write), and a PUT request that matches a TLB entry with the RW field set to 0 generates a write protection error response. The ASN in each TLB entry is used to qualify an address match (i.e., the ASN in the TLB entry must match the ASN presented to the TLB for the TLB entry to hit). Including the ASN in the TLB entry allows entries for multiple processes/address spaces to reside in the TLB concurrently.
[0076] The VPN field holds the virtual page number for the TLB entry. For a TLB entry hit to occur, the bits above the page boundary (determined by the page size) of the supplied virtual address must match the corresponding bits of the VPN. The PPN field holds the physical page number which is mapped to the virtual address. Upon a hit, the bits above the page boundary of the virtual address are replaced by the corresponding bits of the PPN. Note that a TLB cache hit requires: (1) the supplied virtual address bits above the page boundary to match the corresponding bits of the VPN; (2) the supplied ASN to match the ASN stored in the ASN field; and (3) the VALID bit in the TLB entry to be set to 1.
[0077] The RID field stores the region identifier used for TLB shootdown, and is set to a number matching the RID used in processor TLB shootdown. The shootdown mechanism waits for any outstanding packets currently using an address translated by a given TLB entry to complete before acknowledging completion of the SHUB TLB shootdown event.
[0078] If the NACK (“negative acknowledgment”) bit is set, CE
[0079] The steps involved in performing a local address translation are as follows. First, a user process running on the local node provides a virtual address and an appropriate CD. For example, a user process performing a local block copy by directly programming a CI would write the virtual address of the source memory buffer into the CI_SOURCE_VA register, write the virtual address of the destination memory buffer into the CI_DEST_VA register, and then write a command to the CI_COMMAND register to initiate the transfer. In this case, the TYPE field of the CI_COMMAND register would be set to 0 to command a memory copy operation, and the SRC_CD and DEST_CD fields of the CI_CO AND register would be written with zero to indicate that the local node is the endpoint of the virtual connection for both the source and destination. Then, after the transfer is initiated, hardware accesses the LCT using the CD as an index, and produces the SNID for the local SHUB and an associated ASN. In the example, the hardware would access the first entry of the LCT twice (since CD zero was used for both the source and destination), and would retrieve the SNID for the local SHUB for both the source and destination. Then, the virtual address supplied by the user and the ASN from the LCT entry would be applied to the TLB of the local SHUB to look for a matching TLB entry. Again, in this example, this would occur for both the source and destination. If a matching TLB entry is found (i.e., both the virtual address and ASN match), the VPN of the virtual address is replaced by the PPN from the matching TLB entry, and the physical address is used to perform the memory access. Thus, the address translation mechanism can perform local address translations.
[0080] As this discussion illustrates, a LCT entry can point to the local SHUB for local memory accesses. In this case, any access using the associated CD need not be forwarded to a remote SHUB for translation. Rather, the TLB-based translation from a virtual address to a physical address occurs at the local SHUB using the TLB located on the local SHUB.
[0081] The SHUB at each node of multiprocessor system
[0082] Remote address translation is first described in reference to an exemplary transfer of data from local memory on a local node to remote memory on a remote node. This transfer involves a first CE
[0083] Referring to
[0084] Then, after the transfer is initiated, the master CE performs a series of GET requests to obtain the source data from the local memory, and a series of PUT requests to send the data to the remote node. In one embodiment, the master CE includes a buffer
[0085] To perform the PUT requests, hardware accesses the LCT using the DEST_CD as an index to produce the SNID for the remote SHUB and an associated ASN. Then, the master CE formats a series of PUT requests into packets. Each PUT request packet contains the SNID (which tells the system to route the request to the remote node), the ASN (which tells the remote SHUB which address space the write will be made to), the destination virtual address, and the PUT data (i.e., the data being written, which came from the local memory via buffer
[0086] Referring now to
[0087] Thus, the address translation mechanism may be used to perform both local and remote address translations, with the TLB on the local SHUB used for translating a virtual address if a CD indicates that the local node is the connection endpoint, and the TLB on a remote SHUB used for translating the virtual address if a CD indicates that the remote node is the endpoint. As this example shows, each CE is capable of acting as a master or a slave.
[0088] Remote address translation is next described in reference to an exemplary transfer of data from source memory on a remote node to local memory on a local node. This transfer involves a first CE
[0089] Referring to
[0090] Then, after the transfer is initiated, the master CE performs a series of GET requests to obtain the source data from the remote node, and a series of PUT requests to write the data to local memory. In one embodiment, the master CE includes a buffer
[0091] Referring to
[0092] Referring back to
[0093] As indicated above, in one embodiment, the master CE
[0094] As this discussion illustrates, an LCT entry can point to the local SHUB or a remote SHUB. If an LCT entry points to the local SHUB, an address translation for a block transfer or AMO takes place using the external TLB on the local SHUB. If the LCT entry points to a remote SHUB, however, a virtual address is communicated across the network, and address translation takes place using the external TLB on the remote SHUB. Such remote translation relieves pressure in the SHUB TLBs, and supports communication with remote coherence domains in clustered systems. Additional buffering may be allocated at the remote end of the connection to deal with protocol nacks for physical memory accesses.
[0095] In another embodiment, the master CE is not located on the local node (i.e., is not located on the same node as the source memory). Instead, the master CE is located on a node remote from the node of the source memory. In this embodiment, it is sufficient that the master CE and the source memory are in the same coherence domain.
[0096] In yet another embodiment, the slave CE (i.e., the target CE which performs the memory address translation) is not located on the same node as the target memory. Instead, the slave CE is located on a node remote from the node of the target memory. In this embodiment, it is sufficient that the slave CE and the target memory are in the same coherence domain.
[0097] In still another embodiment, which is a combination of the previous two embodiments, the master CE is located on a node remote from the source memory node, and the slave CE is located on a node remote from the target memory node.
[0098] The remote address translation feature of CE
[0099] The first type of congestion, which may occur even in the desired case of all remote address translation attempts hitting in the TLB on the remote SHUB, is due to CE
[0100] The second (and potentially more serious) type of congestion arises due to remote translation faults (i.e., no matching entry in the TLB on the remote SHUB). On a remote translation fault, the slave SHUB responds to the request packet by sending a remote-translation nack (i.e., negative acknowledgment) to the master CE, and by interrupting one of local processors
[0101] Typically, the interrupt handler for the remote TLB miss can quickly fix the TLB by, for example, loading the missing TLB entry from a page table
[0102] In some cases, however, the interrupt handler for the remote TLB miss will not be able to quickly fix the TLB. This will occur, for example, where a page fault is detected (i.e., the missing TLB entry is not located in page table
[0103] Upon receiving a nack due to a remote translation fault, the master CE goes into a retry mode. In this mode, the master CE repeatedly resends the nacked packet until the packet is finally accepted. Thus, the nack merely informs the master CE that the resource is busy and should be retried since it will eventually be successful (i.e., there is no error). The retry mode uses relatively little network bandwidth since there is only one outstanding packet that is being sent back and forth between the master CE and slave CE. In the non-error case, the TLB on the remote SHUB will eventually be updated with a new, matching entry, and the retried packet will be accepted. At that time, the master CE will turn the retry mode off and, in the case of a block transfer, will restart the pipelined transfer.
[0104] In the case where the missing TLB entry can be quickly loaded from page table
[0105] For a block transfer, the master CE may have many packets in flight at the time of the first nacked packet. Most of these packets are also likely to come backed nacked, although some packets may succeed if they were targeted at a different page. Retries for these other packets are preferably suppressed (i.e., the retry mode is used only for the first nacked address). In one embodiment, after the retried packet is eventually accepted, the master CE is allowed to retransmit some of the previously accepted packets.
[0106] Since a new TLB entry could be replaced before the nacked packet can use it, and since an unlucky packet could always arrive to find the remote processor servicing a different TLB miss, a forward progress mechanism may be implemented. This mechanism would eventually inhibit TLB miss interrupts for newer packets, thereby guaranteeing that the older nacked packets are able to be serviced in the TLB. Successful translations would not be affected by this priority mechanism. Since this priority mechanism should be resilient to nacked packets that are not being retried (e.g., the corresponding transfer may have timed out), this priority mechanism should be capable of timing out priority levels.
[0107] In the case of a legal translation resulting in either a TLB refill from the page table, or a page fault, the retry mechanism may suffice. However, in the case of an illegal access where no translation is available, a considerable delay may occur before the transfer is shut down (e.g., by a time-out, or by the remote OS notifying the local OS). In this case, it may not be desirable to repeatedly interrupt the remote processor. To handle this case, the master CE may use a programmable delay (e.g., using the RETRY_DELAY field of the privileged RETRY_CTRL register to specify a number of clocks to delay before re-sending a nacked transfer packet) and an exponential back-off policy (e.g., using the BACKOFF field of the RETRY_CTRL register to specify a number of nacked retries before doubling the current retry delay) in the retry mechanism. Alternatively, the remote OS can load a matching TLB entry with the ERROR bit set, which will cause an error response to be sent to any subsequent, matching request packet, so the associated block transfer or AMO will be completed with an error condition. The ERROR bit may then be cleared at the next time any matching page is actually allocated. It may also be desirable to notify the master CE's OS about the illegal translation, such that this OS may choose to kill the offending process, or to simply go into the CI register set to mark the transfer as complete with an error status.
[0108] Since SHUB TLB faults do not immediately shut down the entities creating the references, and since the TLB miss interrupt handler steals cycles from a processor other than the processor running the process which caused the fault, it is desirable to keep SHUB TLB faults to a minimum. This can be accomplished by using a relatively large and associative TLB to minimize TLB misses. For example, the TLB may have 1024 or 2048 entries, and may be 8-way associative. The TLB may also support multiple page sizes (e.g., the same set used in the processor's TLBs), with each way being configurable to its own page size and resulting in a maximum of 8 page sizes which can be used at one time.
[0109] Note that, while some examples herein use an external TLB on a remote node to perform remote translation to access memory located on the remote node, the resulting physical address may map to another node in the multiprocessor (i.e., the target or source physical memory need not be on the remote node). If not, the physical address resulting from the translation is forwarded over the interconnection network. The remote node may include a forwarding mechanism which allocates temporary buffer space for forwarded requests (or uses extra virtual channels in the network) to prevent deadlock. Nacks may be used if buffering cannot be allocated. A forward progress mechanism ensures that requests nacked due to TLB misses or buffer overflows cannot be indefinitely starved.
[0110] To avoid network deadlock caused by forwarding packets after remote translation, multiprocessor system
[0111] In small, routerless multiprocessor systems, all four virtual channels are used to allow torus routing configurations, and forwarding is not allowed. In such systems, either all translations are required to occur at the mastering CE (by restricting LCT entries), or remote translations are only allowed to translate to memory attached to the remote SHUB.
[0112] In larger multiprocessor systems, since SHUB-to-SHUB packet transmission requires only one virtual channel, the two request and two response channels are available for forwarding. If the virtual channel assignments in the routing table use the REQ
[0113] Referring to
[0114] Since the remote SHUB can accept nacks for the request packets that it forwards, the remote SHUB allocates buffering for these packets at the time it performs the translation. The SHUB associated with the remote memory thus sends the response back through the remote SHUB (rather than performing three-hop forwarding) in order to clear up these buffer entries. This buffering could be used to allow the remote SHUB to accept memory response packets on RESP
[0115] The remote SHUB keeps counters associated with the TLB for tracking outstanding requests. On a TLB shootdown event, the remote SHUB does not acknowledge completion of the shootdown until any currently outstanding requests that used the relevant TLB entry are completed. This may be done by coloring translated requests red or black and keeping a counter for each color. On a shootdown event, the active color for new requests is changed, and the shootdown acknowledge is delayed until the counter for the old color decrements to zero.
[0116] In the multiprocessor systems disclosed herein, intra-coherence and inter-coherence domain block transfers and AMOs appear the same from a software interface perspective. Since local memory and memory in another coherence domain are both accessed in the same way (i.e., via a CD), messaging software (e.g., MPI and shmem( ) software) can operate in the same manner when performing intra- and inter-coherence domain transfers. This eliminates the need for user/library software to know if the endpoint of a transfer or AMO address is in the same coherence domain. Thus, the multiprocessor appears very much like a single machine in terms of performance scaling and ease of programming applications. Various standard interfaces for cluster-based computing, such as ST and VIA, are also supported.
[0117] The address translation mechanism disclosed herein supports communication within a scalable multiprocessor, or across machine boundaries in a cluster, by performing local address translation using an external TLB in a local SHUB and remote address translation using an external TLB in a remote SHUB. This allows memory management to be performed local to the memory being accessed, and significantly reduces the amount of address translation information required at the source node for remote memory accesses (e.g., using this mechanism, it is not necessary for one OS image to be aware of the virtual-to-physical address translations used by the other OS images in the multiprocessor system). This reduction in the amount of address translation information required at the source node may be particularly advantageous in the larger multiprocessor systems that have large amount of memory and many processors. The disclosed address t