Title:
Control-quasi-independent-points guided speculative multithreading
Kind Code:
A1


Abstract:
A method for generating instructions to facilitate control-quasi-independent-point multithreading is provided. A spawn point and control-quasi-independent-point are determined. An instruction stream is generated to partition a program so that portions of the program are parallelized by speculative threads. A method of performing control-quasi-independent-point guided speculative multithreading includes spawning a speculative thread when the spawn point is encountered. An embodiment of the method further includes performing speculative precomputation to determine a live-in value for the speculative thread.



Inventors:
Marcuello, Pedro (Barcelona, ES)
Gonzalez, Antonio (Barcelona, ES)
Wang, Hong (Fremont, CA, US)
Shen, John P. (San Jose, CA, US)
Hammarlund, Per (Hillsboro, OR, US)
Hoflehner, Gerolf F. (Santa Clara, CA, US)
Wang, Perry H. (San Jose, CA, US)
Liao, Steve Shih-wei (Palo Alto, CA, US)
Application Number:
10/356435
Publication Date:
08/05/2004
Filing Date:
01/31/2003
Assignee:
MARCUELLO PEDRO
GONZALEZ ANTONIO
WANG HONG
SHEN JOHN P.
HAMMARLUND PER
HOFLEHNER GEROLF F.
WANG PERRY H.
LIAO STEVE SHIH-WEI
Primary Class:
Other Classes:
712/E9.047, 712/E9.05, 712/E9.053, 714/E11.207, 717/119, 717/149, 712/E9.032
International Classes:
G06F9/00; G06F9/30; G06F9/38; G06F9/44; G06F9/45; G06F9/46; G06F9/48; (IPC1-7): G06F9/45; G06F9/44
View Patent Images:
Related US Applications:



Primary Examiner:
DAO, THUY CHAN
Attorney, Agent or Firm:
WOMBLE BOND DICKINSON (US) LLP/Mission (Attn: IP Docketing P.O. Box 7037, Atlanta, GA, 30357-0037, US)
Claims:

What is claimed is:



1. A method of compiling a software program, comprising: selecting a spawning pair that includes a spawn point and a control-quasi-independent point (CQIP); providing for calculation of a live-in value for a speculative thread; and generating an enhanced binary file that includes instructions, the instructions including a trigger instruction to cause spawning of the speculative thread at the CQIP.

2. The method of claim 1, further comprising: performing profile analysis.

3. The method of claim 1, further comprising: computing a plurality of reaching probabilities.

4. The method of claim 1, further comprising: identifying a plurality of candidate basic blocks.

5. The method of claim 4, wherein: selecting a spawning pair further comprises selecting the spawning pair from the plurality of candidate basic blocks.

6. The method of claim 1, wherein: generating the enhanced binary file further comprises embedding a trigger at a spawn point associated with the spawning pair.

7. The method of claim 1, wherein selecting the spawning pair further comprises: selecting a spawning pair having at least a minimum average number of instructions between the spawn point and the CQIP of the spawning pair.

8. The method of claim 3, wherein selecting the spawning pair further comprises: selecting a spawning pair having at least a minimum reaching probability.

9. The method of claim 1, wherein providing for calculation of the live-in value further comprises: providing an instruction to invoke hardware prediction of the live-in value.

10. The method of claim 1, wherein providing for calculation of the live-in value further comprises: generating one or more instructions to perform speculative precomputation of the live-in values.

11. The method of claim 1, wherein: selecting a spawning pair further comprises selecting a first spawning pair and a second spawning pair; and generating an enhanced binary file that includes instructions further comprises generating an enhanced binary file that includes a trigger instruction for each spawning pair.

12. An article comprising: a machine-readable storage medium having a plurality of machine accessible instructions; wherein, when the instructions are executed by a processor, the instructions provide for selecting a spawning pair that includes a spawn point and a control-quasi-independent point (CQIP); providing for calculation of a live-in value for a speculative thread; and generating an enhanced binary file that includes instructions, the instructions including a trigger instruction to cause spawning of a speculative thread at the control-quasi-independent.

13. The article of claim 12, wherein the instructions further comprise: instructions that provide for performing profile analysis.

14. The article of claim 12, wherein the instructions further comprise: instructions that provide for computing a plurality of reaching probabilities.

15. The article of claim 12, wherein the instruction further comprise: instructions that provide for identifying a plurality of candidate basic blocks.

16. The article of claim 15, wherein: the instructions that provide for selecting a spawning pair further comprise instructions that provide for selecting the spawning pair from the plurality of candidate basic blocks.

17. The article of claim 12, wherein: the instructions that provide for generating the enhanced binary file further comprise instructions that provide for embedding a trigger at a spawn point associated with the spawning pair.

18. The article of claim 12, wherein the instructions that provide for selecting the spawning pair further comprise: instructions that provide for selecting a spawning pair having at least a minimum average number of instructions between the spawn point and the CQIP of the spawning pair.

19. The article of claim 14, wherein the instructions that provide for selecting the spawning pair further comprise: instructions that provide for selecting a spawning pair having at least a minimum reaching probability.

20. The article of claim 12, wherein the instructions that provide for providing for calculation of the live-in value further comprise: instructions that provide for providing an instruction to invoke hardware prediction of the live-in value.

21. The article of claim 12, wherein instructions that provide for providing for calculation of the live-in value further comprise: instructions that provide for generating one or more instructions to perform speculative precomputation of the live-in values.

22. A method, comprising: executing one or more instructions in a first instruction stream in a non-speculative thread; spawning a speculative thread at a spawn point in the first instruction stream, wherein the computed probability of reaching a control quasi-independent point during execution of the first instruction stream, after execution of the spawn point, is higher than a predetermined threshold; and simultaneously: executing in the speculative thread a speculative thread instruction stream that includes a subset of the instructions in the first instruction stream, the speculative thread instruction stream including the control-quasi-independent point; and executing one or more instructions in the first instruction stream following the spawn point.

23. The method of claim 22, wherein: executing one or more instructions in the first instruction stream following the spawn point further comprises executing instructions until the CQIP is reached.

24. The method of claim 23, further comprising: determining, responsive to reaching the CQIP, whether speculative execution performed in the speculative thread is correct.

25. The method of claim 24, further comprising: responsive to determining the speculative execution performed in the speculative thread is correct, relinquishing the non-speculative thread.

26. The method of claim 24, further comprising: responsive to determining that the speculative execution performed in the speculative thread is not correct, squashing the speculative thread.

27. The method of claim 26, further comprising: responsive to determining that the speculative execution performed in the speculative thread is not correct, squashing all active successor threads, if any, of the speculative thread.

28. The method of claim 22, wherein: the speculative thread instruction stream includes a precomputation slice for the speculative computation of a live-in value.

29. The method of claim 22, wherein: spawning the speculative thread triggers hardware prediction of a live-in value.

30. The method of claim 28, wherein: the speculative thread instruction stream includes, after the precomputation slice, a branch instruction to the CQIP.

31. The method of claim 22, further comprising: spawning a second speculative thread at a spawn point in the speculative thread instruction stream.

32. An article comprising: a machine-readable storage medium having a plurality of machine accessible instructions; wherein, when the instructions are executed by a processor, the instructions provide for executing one or more instructions in a first instruction stream in a non-speculative thread; spawning a speculative thread at a spawn point in the first instruction stream, wherein the computed probability of reaching a control quasi-independent point during execution of the first instruction stream, after execution of the spawn point, is higher than a predetermined threshold; and simultaneously: executing in the speculative thread a speculative thread instruction stream that includes a subset of the instructions in the first instruction stream, the speculative thread instruction stream including the control-quasi-independent point; and executing one or more instructions in the first instruction stream following the spawn point.

33. The article of claim 32, wherein: the instructions that provide for executing one or more instructions in the first instruction stream following the spawn point further comprise instructions that provide for executing instructions until the CQIP is reached.

34. The article of claim 33, wherein the instructions further comprise: instructions that provide for determining, responsive to reaching the CQIP, whether speculative execution performed in the speculative thread is correct.

35. The article of claim 34, wherein the instructions further comprise: instructions that provide for, responsive to determining that the speculative execution performed in the speculative thread is correct, relinquishing the non-speculative thread.

36. The article of claim 34, further comprising: instructions that provide for, responsive to determining that the speculative execution performed in the speculative thread is not correct, squashing the speculative thread.

37. The article of claim 36, wherein the instructions further comprise: instructions that provide for, responsive to determining that the speculative execution performed in the speculative thread is not correct, squashing all active successor threads, if any, of the speculative thread.

38. The article of claim 32, wherein: the speculative thread instruction stream includes a precomputation slice for the speculative computation of a live-in value.

39. The article of claim 32, wherein: the instruction that provides for spawning the speculative thread triggers hardware prediction of a live-in value.

40. The article of claim 38, wherein: the speculative thread instruction stream includes, after the precomputation slice, a branch instruction to the CQIP.

41. A compiler comprising: a spawning pair selector module to select a spawning pair that includes a control-quasi-independent point (“CQIP”) and a spawn point; and a code generator to generate an enhanced binary file that includes a trigger instruction at the spawn point.

42. The compiler of claim 41, wherein: the trigger instruction is to spawn a speculative thread to begin execution at the CQIP.

43. The compiler of claim 41, further comprising: a slicer to generate a slice for precomputation of a live-in value; wherein the code generator is further to include the precomputation slice in the enhanced binary file.

44. The compiler of claim 41, wherein: the spawning pair selector module is further to select the spawning pair such that a computed probability of reaching the control-quasi-independent point after execution of the spawn point is higher than a predetermined threshold.

45. The compiler of claim 44, further comprising: a matrix builder to compute the reaching probability for the spawning pair.

46. The compiler of claim 41, further comprising: a profile analyzer to build a control flow graph.

47. The compiler of claim 41, wherein: the trigger instruction is to trigger hardware value prediction for a live-in value.

48. The compiler of claim 41, further comprising: a matrix builder to compute the reaching probability for the spawning pair.

Description:

BACKGROUND

[0001] 1. Technical Field

[0002] The present invention relates generally to information processing systems and, more specifically, to spawning of speculative threads for speculative multithreading.

[0003] 2. Background Art

[0004] In order to increase performance of information processing systems, such as those that include microprocessors, both hardware and software techniques have been employed. One software approach that has been employed to improve processor performance is known as “multithreading.” In multithreading, an instruction stream is split into multiple instruction streams that can be executed in parallel. In software-only multithreading approaches, such as time-multiplex multithreading or switch-on-event multithreading, the multiple instruction streams are alternatively executed on the same shared processor.

[0005] Increasingly, multithreading is supported in hardware. For instance, in one approach, processors in a multi-processor system, such as a chip multiprocessor (“CMP”) system, may each act on one of the multiple threads simultaneously. In another approach, referred to as simultaneous multithreading (“SMT”), a single physical processor is made to appear as multiple logical processors to operating systems and user programs. That is, each logical processor maintains a complete set of the architecture state, but nearly all other resources of the physical processor, such as caches, execution units, branch predictors control logic and buses are shared. The threads execute simultaneously and make better use of shared resources than time-multiplex multithreading or switch-on-event multithreading.

[0006] For those systems, such as CMP and SMT multithreading systems, that provide hardware support for multiple threads, one or more threads may be idle during execution of a single-threaded application. Utilizing otherwise idle threads to speculatively parallelize the single-threaded application can increase speed of execution, but it is often-times difficult to determine which sections of the single-threaded application should be speculatively executed by the otherwise idle thread. Speculative thread execution of a portion of code is only beneficial if the application's control-flow ultimately reaches that portion of code. In addition, speculative thread execution can be delayed, and rendered less effective, due to latencies associated with data fetching. Embodiments of the method and apparatus disclosed herein address these and other concerns related to speculative multithreading.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The present invention may be understood with reference to the following drawings in which like elements are indicated by like numbers. These drawings are not intended to be limiting but are instead provided to illustrate selected embodiments of a method and apparatus for facilitating control-quasi-independent-points guided speculative multithreading.

[0008] FIG. 1 is a flowchart illustrating at least one embodiment of a method for generating instructions for control-quasi-independent-points guided speculative multithreading.

[0009] FIG. 2 is a flowchart illustrating at least one embodiment of a method for identifying control-quasi-independent-points for speculative multithreading.

[0010] FIG. 3 is a data flow diagram showing at least one embodiment of a method for generating instructions for control-quasi-independent-points guided speculative multi threading.

[0011] FIG. 4 is a flowchart illustrating at least one embodiment of a software compilation process.

[0012] FIG. 5 is a flowchart illustrating at least one embodiment of a method for generating instructions to precompute speculative-thread's live-in values for control-quasi-independent-points guided speculative multithreading.

[0013] FIGS. 6 and 7 are flowcharts illustrating at least one embodiment of a method for performing speculative multithreading using a combination of control-quasi-independent-points guided speculative multithreading and speculative precomputation of live-in values.

[0014] FIG. 8 is a block diagram of a processing system capable of performing at least one embodiment of control-quasi-independent-points guided speculative multithreading.

DETAILED DISCUSSION

[0015] FIG. 1 is a flowchart illustrating at least one embodiment of a method for generating instructions to facilitate control-quasi-independent-points (“CQIP”) guided speculative multithreading. For at least one embodiment of the method 100, instructions are generated to reduce the execution time in a single-threaded application through the use of one or more simultaneous speculative threads. The method 100 thus facilitates the parallelization of a portion of an application's code through the use of the simultaneous speculative threads. A speculative thread, referred to as the spawnee thread, executes instructions that are ahead of the code being executed by the thread that performed the spawn. The thread that performed the spawn is referred to as the spawner thread. For at least one embodiment, the spawnee thread is an SMT thread that is executed by a second logical processor on the same physical processor as the spawner thread. One skilled in the art will recognize that the method 100 may be utilized in any multithreading approach, including SMT, CMP multithreading or other multiprocessor multithreading, or any other known multithreading approach that may encounter idle thread contexts.

[0016] Traditional software program parallelization techniques are usually applied to numerical and regular applications. However, traditional automated compiler parallelization techniques do not perform well for irregular or non-numerical applications such as those that require accesses to memory based on linked data structures. Nonetheless, various studies have demonstrated that these irregular and integer applications still have large amounts of thread level parallelism that could be exploited through judicious speculative multithreading. The method 100 illustrated in FIG. 1 provides a mechanism to partition single-threaded application into sub-tasks that can be speculatively executed using additional threads.

[0017] In contrast to some types of traditional speculative multithreading techniques, which spawn speculative threads based on known control dependent structures such as calls or loops, the method 100 of FIG. 1 determines spawn points based on control independency, yet makes provision for handling data flow dependency among parallel threads. The following discussion explains that the method 100 selects thread spawning points based on an analysis of control independence, in an effort to achieve speculative parallelization with minimal misspecualtion in relation to control flow. In addition, the method addresses data flow dependency in that live-in values are supplied. For at least one embodiment, live-in values are predicted using a value prediction approach. In at least one other embodiment, live-in values are pre-computed using speculative precomputation based on backward dependency analysis.

[0018] FIG. 1 illustrates that a method 100 for generating instructions to facilitate CQIP-guided multithreading includes identification 10 of spawning pairs that each include a spawn point and a CQIP. At block 50, the method 100 provides for calculation of live-in values for data dependences in the helper thread to be spawned. At block 60, instructions are generated such that, when the instructions are executed by a processor, a speculative thread is spawned and speculatively executes a selected portion of the application's code.

[0019] FIG. 2 is a flowchart further illustrating at least one embodiment of identification 10 of control-quasi-independent-points for speculative multithreading. FIG. 2 illustrates that the method 10 performs 210 profile analysis. During the analysis 210, a control flow graph (see, e.g., 330 of FIG. 3) is generated to represent flow of control among the basic blocks associated with the application. The method 10 then computes 220 reaching probabilities. That is, the method 10 computes 220 the probability that a second basic block will be reached during execution of the source program, if a first basic block is executed. Candidate basic blocks are identified 230 as potential spawn pairs based on the reaching probabilities previously computed 220. At block 240, the candidates are evaluated according to selected metrics in order to select one or more spawning pairs. Each of blocks 210 (performing profile analysis), 220 (computing reaching probabilities), 230 (identifying candidate basic blocks), and 240 (selecting spawning pair) are described in further detail below in connection with FIG. 3.

[0020] FIG. 3 is a data flow diagram. The flow of data is represented in relation to an expanded flowchart that incorporates the actions illustrated in both FIGS. 1 and 2. FIG. 3 illustrates that, for at least one embodiment of the method 100 illustrated in FIG. 1, certain data is consulted, and certain other data is generated, during execution of the method 100. FIG. 3 illustrates that a profile 325 is accessed to aid in profile analysis 210. Also, a control flow graph 330 (“CFG”) is accessed to aid in computation 220 of reaching probabilities.

[0021] Brief reference to FIG. 4 illustrates that the profile 325 is typically generated by one or more compilation passes prior to execution of the method. In FIG. 4, a typical compilation process 400 is represented. The process 400 involves two compiler-performed passes 405, 410 and also involves a test run 407 that is typically initiated by a user, such as a software programmer. During a first pass 405, the compiler (e.g., 808 in FIG. 8) receives as an input the source code 415 for which compilation is desired. The compiler then generates instrumented binary code 420 that corresponds to the source code 415. The instrumented binary code 420 includes, in addition to the binary for the source code 415 instructions, extra binary code that causes, during a run of the instrumented code 420, statistics to be collected and recorded in a profile 325 and a call graph 424. When a user initiates a test run 407 of the instrumented binary code 420, the profile 325 and call graph 424 are generated. During the normal compilation pass 410, the profile 325 is used as an input into the compiler and a binary code file 340 is generated. The profile 325 may be used, for example, by the compiler during the normal compilation pass 410 to aid with performance enhancements such as speculative branch prediction.

[0022] Each of the passes 405, 410, and the test run 407, are optional to the method 100 in that any method of generating the information represented by profile 325 may be utilized. Accordingly, first pass 405 and normal pass 410, as well as test run 407, are depicted with broken lines in FIG. 4 to indicate their optional nature. One skilled in the art will recognize that any method of generating the information represented by profile 325 may be utilized, and that the actions 405, 407, 410 depicted in FIG. 4 are provided for illustrative purposes only. One skilled in the art will also recognize that the method 100 described herein may be applied, in an alternative embodiment, to a binary file. That is, the profile 325 may be generated for a binary file rather than a high-level source code file, and the profile analysis 210 (FIG. 2) may be performed using such binary-based profile as an input.

[0023] Returning to FIG. 3, one can see that the profile analysis 210 utilizes the profile 325 as an input and generates a control flow graph 330 as an output. The method 100 builds the CFG 330 during the profile analysis 210 such that each node of the CFG 330 represents a basic block of the source program. Edges between nodes of the CFG 330 represent possible control flows among the basic blocks. For at least one embodiment, edges of the CFG 330 are weighted with the frequency that the corresponding control flow has been followed (as reflected in the profile 325). Accordingly, the edges are weighted by the probability that one basic block follows the other, without revisiting the latter node. In contrast to other CFG representations, such as “edge profiling” which represents only intra-procedural edges, at least one embodiment of the CFG 330 created during profile analysis 210 includes representation of inter-procedural edges.

[0024] For at least one embodiment, the CFG 330 is pruned to simplify the CFG 330 and control its size. The least frequently executed basic blocks are pruned from the CFG 330. To determine which nodes should remain in the CFG 330, and which should be pruned, the weights of the edges to a block are used to determine the basic block's execution count. The basic blocks are ordered by execution count, and are selected to remain in the CFG 330 according to their execution count. For at least one embodiment, the basic blocks are chosen from highest to lower execution count until a predetermined threshold percentage of the total executed instructions are included in the CFG 330. Accordingly, after weighting and pruning, the most frequently-executed basic blocks are represented in the CFG 330.

[0025] For at least one embodiment, the predetermined threshold percentage of executed instructions chosen to remain in the CFG 330 during profile analysis 20 is ninety (90) percent. For selected embodiments, the threshold may be varied to numbers higher or lower than ninety percent, based on factors such as application requirements and/or machine resource availability. For instance, if a relatively large number of hardware thread contexts are supported by the machine resources, then a lower threshold may be chosen in order to facilitate more aggressive speculation.

[0026] In order to retain control flow information about pruned basic blocks, the following processing may also occur during profile analysis 210. When a node is pruned from the CFG 330, an edge from a predecessor to the pruned node is transformed to one or more edges from that predecessor to the node's successor(s). Also, an edge from the pruned node to a successor is transformed to one or more edges from the pruned node's predecessor(s) to the successor. If, during this transformation, an edge is transformed into multiple edges, the weight of the original edge is proportionally apportioned across the new edges.

[0027] FIG. 3 illustrates that the CFG 330 produced during profile analysis 210 is utilized to compute 220 reaching probabilities. At least one embodiment of reaching probability computation 220 utilizes the profile CFG 330 as an input and generates a reaching probability matrix 335 as an output. As stated above, as used herein the “reaching probability” is the probability that a second basic block will be reached after execution of a first basic block, without revisiting the first basic block. For at least one embodiment, the reaching probabilities computed at block 220 are stored in a two-dimensional square matrix 335 that has as many rows and columns as nodes in the CFG 330. Each element of the matrix represents the probability to execute the basic block represented by the column after execution of the basic block represented by the row.

[0028] For at least one embodiment, this probability is computed as the sum of the frequencies for all the various sequences of basic blocks that exist from the source node to the destination node. In order to simplify the computation, a constraint is imposed such that the source and destination nodes may only appear once in the sequence of nodes as the first and last nodes, respectively, and may not appear again as intermediate nodes. (For determining the probability of reaching a basic block again after it has been executed, the basic block will appear twice—as both the source and destination nodes). Other basic blocks are permitted to appear more than once in the sequence.

[0029] At block 230, the reaching probability matrix 335 is traversed to evaluate pairs of basic blocks and identify those that are candidates for a spawning pair. As used herein, the term “spawning pair” refers to a pair of instructions associated with the source program. One of the instructions is a spawn point, which is an instruction within a first basic block. For at least one embodiment, the spawn point is the first instruction of the first basic block.

[0030] The other instruction is a target point and is, more specifically, a control quasi-independent point (“CQIP”). The CQIP is an instruction within a second basic block. For at least one embodiment, the CQIP is the first instruction of the second basic block. A spawn point is the instruction in the source program that, when reached, will activate creation of a speculative thread at the CQIP, where the speculative thread will start its execution.

[0031] For each element in the reaching probability matrix 335, two basic blocks are represented. The first block includes a potential spawn point, and the second block includes a potential CQIP. An instruction (such as the first instruction) of the basic block for the row is the potential spawn point. An instruction (such as the first instruction) of the basic block for the column is the potential CQIP. Each element of the reaching probability matrix 335 is evaluated, and those elements that satisfy certain selection criteria are chosen as candidates for spawning pairs. For at least one embodiment, the elements are evaluated to determine those pairs whose probability is higher than a certain predetermined threshold; that is, the probability to reach the control quasi-independent point after execution of the spawn point is higher than a given threshold. This criterion is designed to minimize spawning of speculative threads that are not executed. For at least one embodiment, a pair of basic blocks associated with an element of the reaching probability matrix 335 is considered as a candidate for a spawning pair if its reaching probability is higher than 0.95

[0032] A second criterion for selection of a candidate spawning pair is the average number of instructions between the spawn point and the CQIP. Ideally, a minimum average number of instructions should exist between the spawning point and the CQIP in order to reduce the relative overhead of thread creation. If the distance is too small, the overhead of thread creation may outweigh the benefit of run-ahead execution because the speculative thread will not run far enough ahead. For at least one embodiment, a pair of basic blocks associated with an element of the reaching probability matrix 335 is considered as a candidate for a spawning pair if the average number of instructions between then is greater than 32 instructions.

[0033] Distance between the basic blocks may be additionally stored in the matrix 335 and considered in the identification 230 of spawning pair candidates. For at least one embodiment, this additional information may be calculated during profile analysis 210 and included in each element of the reaching probability matrix 335. The average may be calculated as the sum of the number of instructions executed by each sequence of basic blocks, multiplied by their frequency.

[0034] At block 240, the spawning pair candidates are evaluated based on analysis of one or more selected metrics. These metrics may be prioritized. Based on the evaluation of the candidate spawning pairs in relation to the prioritized metrics, one or more spawning pairs are selected.

[0035] The metrics utilized at block 240 may include the minimum average distance between the basic blocks of the potential spawning pair (described above), as well as an evaluation of mispredicted branches, load misses and/or instruction cache misses. The metrics may also include additional considerations. One such additional consideration is the maximum average distance between the basic blocks of the potential spawning pair. It should be noted that there are also potential performance penalties involved with having the average number of instructions between the spawn point and CQIP be too large. Accordingly, the selection of spawning pairs may also impose a maximum average distance. If the distance between the pair is too large, the speculative thread may incur stalls in a scheme where the speculative thread has limited storage for speculative values. In addition, if the sizes of speculative threads are sufficiently dissimilar, speculative threads may incur stalls in a scheme where the speculative thread cannot commit its states until it becomes the non-speculative thread (see discussion of “join point” in connection with FIGS. 6 and 7, below). Such stalls are likely to result in ineffective holding of critical resources that otherwise would be used by non-speculative threads to make forward progress.

[0036] Another additional consideration is the number of dependent instructions that the speculative thread includes in relation to the application code between the spawning point and the CQIP. Preferably, the average number of speculative thread instructions dependent on values generated by a previous thread (also referred to as “live-ins”) should be relatively low. A smaller number of dependent instructions allow for more timely computation of the live-in values for the speculative thread.

[0037] In addition, for selected embodiments it is preferable that a relatively high number of the live-in values for the speculative thread are value-predictable. For those embodiments that use value prediction to provide for calculation 50 of live-in values (discussed further below), value-predictability of the live-in values facilitates faster communication of live-in values, thus minimizing overhead of spawning while also allowing correctness and accuracy of speculative thread computation.

[0038] It is possible that the candidate spawning pairs identified at block 230 may include several good candidates for CQIP's associated with a given spawn point. That is, for a given row of the reaching probability matrix 335, more than one element may be selected as a candidate spawning pair. In such case, during the metrics evaluation at block 240, the best CQIP for the spawn point is selected because, for a given spawn point, a speculative thread will be spawned at only one CQIP. In order to choose the best CQIP for a given spawn point, the potential CQIP's identified at block 230 are prioritized according to the expected benefit.

[0039] In at least one alternative embodiment, if there are sufficient hardware thread resources, more than one CQIP can be chosen for a corresponding spawn point. In such case, multiple concurrent, albeit mutually exclusive, speculative threads may be spawned and executed simultaneously to perform “eager” execution of speculative threads. The spawning condition for these multiple CQIPs can be examined and verified, after the speculative threads have been executed, to determine the effectiveness of the speculation. If one of these multiple speculative threads proves to be good speculation, and another bad, then the results of the former can be reused by the main thread while the results of the latter may be discarded.

[0040] In addition to those spawning pairs selected according to the metrics evaluation, at least one embodiment of the method 100 selects 240 CALL return point pairs (pairs of subroutine calls and the return points) if they satisfy the minimum size constraint. These pairs might not otherwise be selected at block 240 because the reaching probability for such pairs is sometimes too low to satisfy the selection criteria discussed above in connection with candidate identification 230. In particular, if a subroutine is called from multiple locations, it will have multiple predecessors and multiple successors in the CFG 330. If all the calls are executed a similar number of times, the reaching probability of any return point pair will be low since the graph 330 will have multiple paths with similar weights.

[0041] At block 50, the method 100 provides for calculation of live-in values for the speculative thread to be executed at the CQIP. By “provides for” it is meant that instructions are generated, wherein execution of the generated instructions, possibly in conjunction with some special hardware support, will result in calculation of a predicted live-value to be used as an input by the spawnee thread. Of course, block 50 might determine that no live-in values are necessary. In such case, “providing for” calculation of live-in values simply entails determining that no live-in values are necessary.

[0042] Predicting thread input values allows the processor to execute speculative threads as if they were independent. At least one embodiment of block 50 generates instructions to perform or trigger value prediction. Any known manner of value prediction, including hardware value prediction, may be implemented. For example, instructions may be generated 50 such that the register values of the spawned thread are predicted to be the same as those of the spawning thread at spawn time.

[0043] Another embodiment of the method 100 identifies, at block 50, a slice of instructions from the application's code that may be used for speculative precomputation of one or more live-in values. While value prediction is a promising approach, it often requires rather complex hardware support. In contrast, no additional hardware support is necessary for speculative precomputation. Speculative precomputation can be performed at the beginning of the speculative thread execution in an otherwise idle thread context, providing the advantage of minimizing misspeculations of live-in values without requiring additional value prediction hardware support. Speculative precomputation is discussed in further detail below in connection with FIG. 5.

[0044] FIG. 5 illustrates an embodiment of the method 100 wherein block 50 is further specified to identify 502 precomputation instructions to be used for speculative precomputation of one or more live-in values. For at least one embodiment, a set of instructions, called a slice, is computed at block 502 to include only those instructions identified from the original application code that are necessary to compute the live-in value. The slice therefore is a subset of instructions from the original application code. The slice is computed by following the dependence edges backward from the instruction including the live-in value until all instructions necessary for calculation of the live-in value have been identified. A copy of the identified slice instructions is generated for insertion 60 into an enhanced binary file 350 (FIG. 3).

[0045] FIGS. 3 and 5 illustrate that the methods 100, 500 for generating instructions for CQIP-guided multithreading generate an enhanced binary file 350 at block 60. The enhanced binary file 350 includes the binary code 340 for the original single-threaded application, as well as additional instructions. A trigger instruction to cause the speculative thread to be spawned is inserted into the enhanced binary file 350 at the spawning point (s) selected at block 240. The trigger instruction can be a conventional instruction in the existing instruction set of a processor, denoted with special marks. Alternatively, the trigger instruction can be a special instruction such as a fork or spawn instruction. Trigger instructions can be executed by any thread.

[0046] In addition, the instructions to be performed by the speculative thread are included in the enhanced binary file 350. These instructions may include instructions added to the original code binary file 340 for live-in calculation, and also some instructions already in the original code binary file 340, beginning at the CQIP, that the speculative thread is to execute. That is, regarding the speculative-thread instructions in the enhanced binary file 350, two groups of instructions may be distinguished for each spawning pair, if the speculative thread is to perform speculative precomputation for live-in values. In contrast, for a speculative thread that is to use utilize value prediction for its live-in values, only the latter group of instructions described immediately below appears in the enhanced binary file 350.

[0047] The first group of instructions are generated at block 50 (or 502, see FIG. 5) and are incorporated 60 into the enhanced binary code file 350 in order to provide for the speculative thread's calculation of live-in values. For at least one embodiment, the instructions to be performed by the speculative thread to pre-compute live-in values are appended at the end of the file 350, after those instructions associated with the original code binary file 340.

[0048] Such instructions do not appear for speculative threads that use value prediction. Instead, specialized value prediction hardware may be used for value prediction. The value prediction hardware is fired by the spawn instruction. When the processor executes a spawn instruction, the hardware initializes the speculative thread registers with the predicted live-in value.

[0049] Regardless of whether the speculative thread utilizes value prediction (no additional instructions in the enhanced binary file 350) or speculative precomputation (slice instructions in the enhanced binary file 350), the speculative thread is associated with the second group of instructions alluded to above. The second set of instructions are instructions that already exist in the original code binary file 340. The subset of such instructions that are associated with the speculative thread are those instructions in the original code binary file 340 starting at the CQIP. For speculative threads that utilize speculative pre-computation for live-ins, the precomputation slice (which may be appended at the end of the enhanced binary file) terminates with a branch to the corresponding CQIP, which causes the speculative thread to begin executing the application code instructions at the CQIP. For speculative threads that utilize value prediction for live-in values, the spawnee thread begins execution of the application code instructions beginning at the CQIP.

[0050] In an alternative embodiment, the enhanced binary file 350 includes, for the speculative thread, a copy of the relevant subset of instructions from the original application, rather than providing for the speculative thread to branch to the CQIP instruction of the original code. However, the inventors have found the non-copy approach discussed in the immediate preceding paragraph, which is implemented with appropriate branch instructions, efficiently allows for reduced code size.

[0051] Accordingly, the foregoing discussion illustrates that, for at least one embodiment, method 100 is performed by a compiler 808 (FIG. 8). In such embodiment, the method 100 represents an automated process in which a compiler identifies a spawn point and an associated control-quasi-independent point (“CQIP”) target for a speculative thread, generates the instructions to pre-compute its live-ins, and embeds a trigger at the spawn point in the binary. The pre-computation instructions for the speculative thread are incorporated (such as, for example, by appending) into an enhanced binary file 350. One skilled in the art will recognize that, in alternative embodiments, the method 100 may be performed manually such that one or more of 1) identifying CQIP spawning pairs 10, 2) providing for calculation of live-in values 50, and 3) modification of the main thread binary 60 may be performed interactively with human intervention.

[0052] In sum, a method for identifying spawning pairs and adapting a binary file to perform control-quasi-independent points guided speculative multithreading has been described. An embodiment of the method is performed by a compiler, which identifies proper spawn points and CQIP, provides for calculation of live-in values in speculative threads, and generates an enhanced binary file.

[0053] FIGS. 6 and 7 illustrate at least one embodiment of a method 600 for performing speculative multithreading using a combination of control-quasi-independent-points guided speculative multithreading and speculative precomputation of live-in values. For at least one embodiment, the method 600 is performed by a processor (e.g. 804 of FIG. 8) executing the instructions in an enhanced binary code file (e.g., 350 of FIG. 3). For the method 600 illustrated in FIGS. 6 and 7, it is assumed, that the enhanced binary code file has been generated according to the method illustrated in FIG. 5, such that instructions to perform speculative precomputation of live-in values have been identified 502 and inserted into the enhanced binary file.

[0054] FIGS. 6 and 7 illustrate that, during execution of the enhanced binary code file, multiple threads T0, T1, . . . Tx may be executing simultaneously. The flow of control associated with each of these multiple threads is indicated by the notations T0, T1, and Tx on the edges between the blocks illustrated in FIGS. 6 and 7. One skilled in the art will recognize that the multiple threads may be spawned from a non-speculative thread. Also, in at least one embodiment, a speculative thread may spawn one or more additional non-speculative successor threads.

[0055] FIG. 6 illustrates that processing begins at 601, where the thread T0 begins execution. At block 602, a check is made to determine whether the thread T0 previously encountered a join point while it (T0) was still speculative. Block 602 is discussed in further detail below. One skilled in the art will understand that block 602 will, of course, evaluate to “false” if the thread T0 was never previously speculative.

[0056] If block 602 evaluates to “false”, then an instruction for the thread T0 is executed at block 604. If a trigger instruction associated with a spawn point is encountered 606, then processing continues to block 608. Otherwise, the thread T0 continues execution at block 607. At block 607, it is determined whether a join point has been encountered in the thread T0. If neither a trigger instruction nor join point is encountered, then the thread T0 continues to execute instructions 604 until it reaches 603 the end of its instructions.

[0057] If a trigger instruction is detected at block 606, then a speculative thread T1 is spawned in a free thread context at block 608. If slice instructions are encountered by the speculative thread T1 at block 610, the processing continues at block 612. If not, then processing continues at 702 (FIG. 7).

[0058] At block 612, slice instructions for speculative precomputation are iteratively executed until the speculative precomputation of the live-in value is complete 614. In the meantime, after spawning the speculative thread T1 at block 608, the spawner thread T0 continues to execute 604 its instructions. FIG. 6 illustrates that, while the speculative thread T1 executes 612 the slice instructions, the spawner thread continues execution 604 of its instructions until another spawn point is encountered 606, a join point is encountered 607, or the instruction stream ends 603. Accordingly, the spawner thread T0 and the spawnee thread T1 execute in parallel during speculative precomputation.

[0059] When live-in computation is determined complete 614, or if no slice instructions for speculative precomputation are available to the speculative thread T1 610, then processing continues at A in FIG. 7.

[0060] FIG. 7 illustrates that, at block 702, the speculative thread T1 executes instructions from the original code. At the first iteration of block 702, the CQIP instruction is executed. The execution 702 of spawnee thread instructions is performed in parallel with the execution of the spawner thread code until a terminating condition is reached.

[0061] At block 708, the speculative thread T1 checks for a terminating condition. The check 708 evaluates to “true” when the spawnee thread T1 has encountered a CQIP of an active, more speculative thread or has encountered the end of the program. As long as neither condition is true, the spawnee thread T1 proceeds to block 710.

[0062] If the speculative thread T1 determines 708 that a join point has been reached, then it is theoretically ready to perform processing to switch thread contexts with the more speculative thread (as discussed below in connection with block 720). However, at least one embodiment of the method 600 limits such processing to non-speculative threads. Accordingly, when speculative thread T1 determines 708 that it has reached the joint point of a more speculative, active thread, T1 waits 706 to continue processing until it (T1) becomes non-speculative.

[0063] At block 710, the speculative thread T1 determines whether a spawning point has been reached. If the 710 condition evaluates to “false”, then T1 continues execution 702 of its instructions.

[0064] If a spawn point is encountered at block 710, then thread T1 creates 712 a new speculative thread T1. Thread T1 then continues execution 702 of its instructions, while new speculative thread T1 proceeds to continue speculative thread operation at block 610, as described above in connection with speculative thread T1. One skilled in the art will recognize that, while multiple speculative threads are active, each thread follows the logic described above in connection with T1 (blocks 610 through 614 and blocks 702 through 710 of FIGS. 6 and 7).

[0065] When the spawner thread T0 reaches a CQIP of an active, more speculative thread, then we say that a join point has been encountered. The join point of a thread is the control quasi-independent point at which an on-going speculative thread began execution. It should be understood that multiple speculative threads may be active at one time. Hence the terminology “more speculative.” A “more speculative” thread is a thread that is a spawnee of the reference thread (in this case, thread T0) and includes any subsequently-spawned speculative thread in the spawnee's spawning chain.

[0066] Thus, the join point check 607 (FIG. 6) evaluates to true when the thread T0 reaches the CQIP at which any on-going speculative thread began execution. One skilled in the art will recognize that, if multiple speculative threads are simultaneously active, then any one of the multiple CQIP's for the active speculative threads could be reached at block 607. For simplicity of illustration, FIG. 7 assumes that when T0 hits a join point at bock 607, the join point is associated with T1, the next thread in program order, which is the speculative thread whose CQIP has been reached by the non-speculative thread T0.

[0067] Upon reaching the join point at block 607 (FIG. 6), a thread T0 proceeds to block 703. The thread T0 determines 703 if it is the non speculative active thread and, if not, waits until it becomes the non-speculative thread.

[0068] When T0 becomes non-speculative, it initiates 704 a verification of the speculation performed by the spawnee thread T1. For at least one embodiment, verification 704 includes determining whether the speculative live-in values utilized by the spawnee thread T1 reflect the actual values computed by the spawner thread.

[0069] If the verification 704 fails, then T1 and any other thread more speculative than T1 are squashed 730. Thread T0 then proceeds to C (FIG. 6) to continue execution of its instructions. Otherwise, if the verification 704 succeeds, then thread T0 and thread T1 proceed to block 720. At block 720, the thread context where the thread T0 has been executing becomes free and is relinquished. Also, the speculative thread T1 that started at the CQIP becomes the non-speculative thread and continues execution at C (FIG. 6).

[0070] Reference to FIG. 6 illustrates that the newly non-speculative thread T0 checks at block 602 to determine whether it encountered a CQIP at block 708 (FIG. 6) while it was still speculative. If so, then the thread T0 proceeds to B in order to begin join point processing as described above.

[0071] The combination of both CQIP-based spawning point selection and speculative computation of live-in values illustrated in FIGS. 5, 6 and 7 provide a multithreading method that helps improve the efficacy and accuracy of speculative multithreading. Such improvements are achieved because data dependencies among speculative threads are minimized since the values of live-ins are computed before execution of the speculative thread.

[0072] In the preceding description, various aspects of a method and apparatus for facilitating control-quasi-independent-points guided speculative multithreading have been described. For purposes of explanation, specific numbers, examples, systems and configurations were set forth in order to provide a more thorough understanding. However, it is apparent to one skilled in the art that the described method may be practiced without the specific details. In other instances, well-known features were omitted or simplified in order not to obscure the method.

[0073] Embodiments of the method may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented as computer programs executing on programmable systems comprising at least one processor, a data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Program code may be applied to input data to perform the functions described herein and generate output information. The output information may be applied to one or more output devices, in known fashion. For purposes of this application, a processing system includes any system that has a processor, such as, for example; a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.

[0074] The programs may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The programs may also be implemented in assembly or machine language, if desired. In fact, the method described herein is not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language

[0075] The programs may be stored on a storage media or device (e.g., hard disk drive, floppy disk drive, read only memory (ROM), CD-ROM device, flash memory device, digital versatile disk (DVD), or other storage device) readable by a general or special purpose programmable processing system. The instructions, accessible to a processor in a processing system, provide for configuring and operating the processing system when the storage media or device is read by the processing system to perform the procedures described herein. Embodiments of the invention may also be considered to be implemented as a machine-readable storage medium, configured for use with a processing system, where the storage medium so configured causes the processing system to operate in a specific and predefined manner to perform the functions described herein.

[0076] An example of one such type of processing system is shown in FIG. 8. System 800 may be used, for example, to execute the processing for a method of performing control-quasi-independent-points guided speculative multithreading, such as the embodiments described herein. System 800 may also execute enhanced binary files generated in accordance with at least one embodiment of the methods described herein. System 800 is representative of processing systems based on the Pentium®, Pentium® Pro, Pentium® II, Pentium® III, Pentium® 4, and Itanium® and Itanium® II microprocessors available from Intel Corporation, although other systems (including personal computers (PCs) having other microprocessors, engineering workstations, set-top boxes and the like) may also be used. In one embodiment, sample system 800 may be executing a version of the Windows™ operating system available from Microsoft Corporation, although other operating systems and graphical user interfaces, for example, may also be used.

[0077] Referring to FIG. 8, processing system 800 includes a memory system 802 and a processor 804. Memory system 802 may store instructions 810 and data 812 for controlling the operation of the processor 804. For example, instructions 810 may include a compiler program 808 that, when executed, causes the processor 804 to compile a program 415 (FIG. 4) that resides in the memory system 802. Memory 802 holds the program to be compiled, intermediate forms of the program, and a resulting compiled program. For at least one embodiment, the compiler program 808 includes instructions to select spawning pairs and generate instructions to implement CQIP-guided multithreading. For such embodiment, instructions 810 may also include an enhanced binary file 350 (FIG. 3) generated in accordance with at least one embodiment of the present invention.

[0078] Memory system 802 is intended as a generalized representation of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM) and related circuitry. Memory system 802 may store instructions 810 and/or data 812 represented by data signals that may be executed by processor 804. The instructions 810 and/or data 812 may include code for performing any or all of the techniques discussed herein. At least one embodiment of CQIP-guided speculative multithreading is related to the use of the compiler 808 in system 800 to select spawning pairs and generate instructions as discussed above.

[0079] Specifically, FIG. 8 illustrates that compiler 808 may include a profile analyzer module 820 that, when executed by the processor 804, analyzes a profile to generate a control flow graph as described above in connection with FIG. 3. The compiler 808 may also include a matrix builder module 824 that, when executed by the processor 804, computes 220 reaching probabilities and generates a reaching probabilities matrix 335 as discussed above. The compiler 808 may also include a spawning pair selector module 826 that, when executed by the processor 804, identifies 230 candidate basic blocks and selects 240 one or more spawning pairs. Also, the compiler 808 may include a slicer module 822 that identifies 502 (FIG. 5) instructions for a slice to be executed by a speculative thread in order to perform speculative precomputation of live-in values. The compiler 808 may further include a code generator module 828 that, when executed by the processor 804, generates 60 an enhanced binary file 350 (FIG. 3).

[0080] While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that changes and modifications can be made without departing from the present invention in its broader aspects. The appended claims are to encompass within their scope all such changes and modifications that fall within the true scope of the present invention.