| 6035426 | System for memory error checking in an executable | Applegate | 714/54 | |
| 6119145 | Multithreaded client application storing a separate context for each transaction thus allowing threads to resume transactions started by other client threads | Ikeda et al. | 709/203 | |
| 6192486 | Memory defect steering circuit | Correale et al. | 714/8 | |
| 6256775 | Facilities for detailed software performance analysis in a multithreaded processor | Flynn | 717/127 | |
| 6418542 | Critical signal thread | Yeager | 714/38 | |
| 6434714 | Methods, systems, and articles of manufacture for analyzing performance of application programs | Lewis et al. | 714/38 | |
| 6457142 | Method and apparatus for target application program supervision | Klemm et al. | 714/38 | |
| 6567839 | Thread switch control in a multithreaded processor system | Borkenhagen et al. | 709/103 | |
| 20020059503 | Protocol for coordinating the distribution of shared memory | Dennie | 711/153 |
| JP2002108630 |
This invention relates generally to the field of computer memory circuits and more particularly relates to a method to protect against the loss of computer operations after manufacture and sale of the computer resulting from failure of a register/arrays associated with only one thread of operation.
As consumers of computer processing services, businesses dread the occurrence of down time resulting from a computer failure, yet these events actually happen in any number of ways. A hard drive with data that hasn't been backed-up may crash and the data may be lost. A larger fear, perhaps, is when the processor core of a computer fails to perform properly. New or additional memory will not render the computer functional: the computer itself is broke! Such loss of processing power can be disastrous to many businesses; recall a particular airline's dilemma when its routing and scheduling was unavailable for many hours. Millions and even up to billions of dollars can be lost resulting from downtime of computers.
To avoid the failure of a computer after its components have been fabricated and assembled and the computer has been sold, many manufacturers test their processors and memory components before the computers are sold to customers and eliminate the computer components having errors. In spite of the best efforts to detect and eliminate defective computer processors, computer hardware may still fail during normal stressed processing operations at a customer's location. One type of failure of a processor may be attributed, inter alia, to AC defects of the general and special purpose registers within the processor core and of the computer's main random access memory caused by stressing the components under normal usage.
Some tests are performed at the customer's location, e.g., static random access memory (SRAM) arrays are typically tested at boot-up. If a processor fails its test, it is marked as a “failure” and the system is typically disabled until customer service appears. In the case of multiple processors, a particular processor may be disabled leaving the other processors to take on extra processing so that computer performance is compromised until customer services arrives.
SRAMs, however, do not have multithreaded memory cells. Currently, if a failure occurs in a multithreaded memory array or register, the multithreaded computer system displays an error code indicating a failure as data collision from multiple threads and either the entire system is disabled or, if the system is made up of multiple processors, it operates without the failing processor. In any event, the processor having the failed thread is disabled and processing is compromised.
There is a need in the industry of multithreaded computers to detect defects in registers and/or memory arrays having multithreaded storage cells during normal processing operations. If only those storage elements in a multithreaded memory or register associated with a failed thread becomes unavailable to the processor, the processor itself would not have to be disabled; rather the processor could continue normal processing of the other thread(s) which don't have defective storage elements and could reroute the thread associated with the defective storage elements to intact storage elements. This method would still keep the processor in service rather than disabling the entire processor.
These needs and others that will become apparent to one skilled in the art are satisfied by a method to continue normal computer processing in a hardware multithreaded computer processing system executing a plurality of threads despite the failure of one of the threads, the method comprising the steps of: executing instructions of at least one of the threads in a hardware multithreaded processor having a plurality of hardware register/arrays, each one of the register/arrays associated with each one of the threads; performing a test on a particular thread and the at least one register/array associated with the particular thread; detecting the failure of the at least one register/array associated with the particular thread; disabling the failed register/array associated with the particular thread; and rerouting data of the particular thread to other individual register/arrays that are not defective.
The register/array may be a multithreaded register/array having multithreaded storage cells each comprising a number of storage elements each associated with one thread. In this case, the method of the invention to continue normal computer processing in a hardware multithreaded computer processing system executing a plurality of threads despite the failure of one of the threads comprises the steps of: executing instructions of at least one of the plurality of threads in a hardware multithreaded processor having at least one multithreaded register/array with individual storage elements associated with each of the plurality of threads; performing a test on a particular thread and the at least one multithreaded register/array having individual storage elements associated with the particular thread; detecting the failure of at least one individual storage element associated with the particular thread; disabling all storage elements associated with the particular thread; and rerouting data of the particular thread to other individual storage elements associated with other of the plurality of threads not having defective storage elements.
The method may further comprise generating an error signal indicating failure of the at least one individual storage element associated with the particular thread. Normal processing of the other threads not having defective storage elements may continue.
The step of performing a test on a particular thread and the at least one multithreaded register/array having individual storage elements associated with the particular thread may further comprise running a functional test to execute instructions under stressed processing conditions. Alternatively, the test may be a n ABIST and/or LBIST test.
The step of disabling all storage elements associated with the particular thread may comprise generating a thread select signal to select others of individual storage elements associated with other threads. In another embodiment, the step of disabling all storage elements associated with the particular thread further may comprise disabling all or some of a plurality of thread switch control events pertaining to the particular thread in a thread switch control register.
The invention may further be considered a multithreaded computer system capable of disabling one thread in the field comprising: at least one multithreaded computer processor; at least one thread switch control register for each of a plurality of threads of operation in the multithreaded computer processor; at least one hardware multithreaded memory/register array having multithreaded storage cells in which each of the storage cells has a storage element uniquely associated with one thread; a main memory connected to the at least one multithreaded computer processor; a bus interface connecting the multithreaded computer processor and the main memory to at least one of the group consisting of: a plurality of data storage devices, one or more external communication networks, one or more input/output devices for providing user input to/from the computer processor; a functional test generator to perform a functional test of at least one thread in the multithreaded computer processor in the field during normal processing; a storage element failure detector which detects the failure of a storage element uniquely associated with the thread undergoing the functional test; a storage element disabler to disable all the storage elements associated with the thread experiencing the failure in the functional test; a data rerouter to redirect data from the thread to storage elements associated with other threads to continue processing; and an error signal generator to propagate a message indicating that the one thread has failed.
The invention is also a program product for use with a hardware multithreaded computer processor for detecting the failure of one of a plurality of threads of operation, the program product comprising a signal-bearing medium carrying thereon: a functional test having a series of instructions of at least one of the plurality of threads; a thread disabler to disable individual storage elements in multithreaded storage cells in hardware registers/memory arrays; the disabled individual storage elements each associated with the at least one thread failing the functional test; a data rerouter to reconfigure any programmable registers to reroute data of the at least one thread failing the functional test to other storage elements in multithreaded storage cells associated with other threads not failing the functional test; and an error message generator to indicate that the at least one thread failed the functional test.
The novel features believed characteristic of the invention are set forth in the claims. The invention itself, however, as well as a preferred mode of use, objects and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying Drawing, wherein:
The major hardware components of a computer system
Each CPU
| Thread Switch Control Register Bit Assignment | ||
| | ||
| (0) | Switch on L1 data cache fetch miss | |
| (1) | Switch on L1 data cache store miss | |
| (2) | Switch on L1 instruction cache miss | |
| (3) | Switch on instruction TLB miss | |
| (4) | Switch on L2 cache fetch miss | |
| (5) | Switch on L2 cache store miss | |
| (6) | Switch on L2 instruction cache miss | |
| (7) | Switch on data TLB/segment lookaside buffer miss | |
| (8) | Switch on L2 cache miss and dormant thread not | |
| L2 cache miss | ||
| (9) | Switch when thread switch time-out value reached | |
| (10) | Switch when L2 cache data returned | |
| (11) | Switch on IO external accesses | |
| (12) | Switch on double-X store: miss on first of two* | |
| (13) | Switch on double-X store: miss on second of two* | |
| (14) | Switch on store multiple/string: miss on any access | |
| (15) | Switch on load multiple/string: miss on any access | |
| (16) | Reserved | |
| (17) | Switch on double-X load: miss on first of two* | |
| (18) | Switch on double-X load: miss on second of two* | |
| (19) | Switch on or 1, 1, 1 instruction if machine state register | |
| (problem state) bit, msr(pr) = 1. Allows software | ||
| priority change independent of msr(pr). If bit 19 is | ||
| one, or 1, 1, 1 instruction sets low priority. | ||
| If bit 19 is zero, priority is set to low only if | ||
| msr(pr)= 0 when the or 1, 1, 1 instruction | ||
| is executed. See changing priority with software, | ||
| to be discussed later. | ||
| (20) | Reserved | |
| (21) | Thread switch priority enable | |
| (22:29) | Thread enablement - one per thread | |
| (30:31) | Forward progress count | |
| (32:63) | Reserved in 64 bit register implementation | |
| |
Additional background information concerning multithreaded processor design is contained in the following commonly assigned copending U.S. patent applications, herein incorporated by reference in their entireties: Serial No. unknown, filed concurrently herewith entitled Changing the Thread Capacity of a Multithreaded Computer Processor; Ser. No. 09/439,581 filed Nov. 12, 1999 entitled Master-Slave Latch Circuit for Multithreaded Processing; Ser. No. 09/266,133 filed Mar. 10, 1999 entitled Instruction Cache for Multithreaded Processor; Ser. No. 08/976,533 filed Nov. 21, 1997 entitled Accessing Data from a Multiple Entry Fully Associative Cache Buffer in a Multithread Data Processing System; Ser. No. 08/966,706 filed Nov. 10, 1997 entitled Effective-To-Real Address Cache Managing Apparatus and Method; Ser. No. 08/958,718 filed Oct. 23, 1997, entitled Altering Thread Priorities in a Multithreaded Processor; Ser. No. 08/958,716 filed Oct. 23, 1997, entitled Method and Apparatus for Selecting Thread Switch Events in a Multithreaded Processor; Ser. No. 08/957,002 filed Oct. 23, 1997 entitled Thread Switch Control in a Multithreaded Processor System; Ser. No. 08/956,875 filed Oct. 23, 1997 entitled An Apparatus and Method to Guarantee Forward Progress in a Multithreaded Processor; Ser. No. 08/956,577 filed Oct. 23, 1997 entitled Method and Apparatus To Force a Thread Switch in a Multithreaded Processor; Ser. No. 08/773,572 filed Dec. 27, 1996 entitled Background Completion of Instruction and Associated Fetch Request in a Multithread Processor. While the multithreaded processor design described in the above applications is a coarse-grained multithreading implementation, it should be understood that the present invention is applicable to either coarse-grained or fine-grained multithreading.
A multithreaded memory array contrasts with a conventional two-threaded memory array having a common read data bus. The paradigm of computer architecture having the common read bus assumed that read independence is necessary and requires a separate read decoder for data of each thread to be read simultaneously. It was discovered, however, that a multithreaded memory having an optimized wireability and associated optimally minimized transistor count can be achieved by eliminating read independence without suffering significant negative consequences because the number of instances in which data for two or more threads are simultaneously required is negligible. The multithreaded storage cell of
A processor (not shown) can read the data in the storage element
Each of the write decoders
After manufacture, the registers and memory arrays having the multithreaded storage cells are tested. There are several test procedures of LBIST and ABIST are only two. Simply stated, a known bit sequence is input into the array and compared with the output of the array. In a properly functional memory array, the input matches the output. These sequences can be performed at high speeds and can involve many memory cells simultaneously to “stress” the processor to determine its failure parameters, if any. Another kind of test is a functional test in which actual coded instructions typical of customer performance requirements are executed under stress to exercise critical sections of the memory arrays. Again, these tests can be performed at different processor speeds to determine if and when the processor may fail. Such failures are most commonly the result of memory arrays and general and specialized registers unable to capture and hold a bit value in the multithreaded storage cell. An important feature of the invention is the ability to test the values stored in individual storage cells associated with separate threads. In other words, in the context of the invention, each thread can now be tested separately.
As will be described in detail below, aspects of the preferred embodiment pertain to specific method steps implementable in a multithreaded computer system
If there are no register or memory array failures in step
In this fashion, the flow chart of
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example and not limitation and that variations are possible. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.