Multithreaded Architectures and Systems

--------Term Paper of CS522 (Computer Organizations and Architecture)

In my point of view, multithreaded architecture is a trend and applicable solution for future microprocessor design, so I studied three research papers on this topic in order to familiarize with the technology and considerations involved in this subject. At the same time, study on this subject also gives me many illuminations and clarifications on the study contents in classes.

  1. Introduction
  2. As the rapid progress of VLSI (Very Large Scale Integrated circuit) technology allows processor designers to incorporate more functional units on a single chip, future microprocessors will soon be able to issue more than a dozen instructions per machine cycle. Existing superscalar and VLIW (Very Large Instruction Word) microprocessors utilize multiple functional units to exploit instruction-level parallelism from a single thread of control flow. Such microprocessors rely on the compiler or the hardware to extract instruction-level parallelism from programs, and to schedule independent instructions on multiple functional units on each machine cycle. As the issue rate of future microprocessors increases, however, the compiler or the hardware will have to extract more instruction-level parallelism from programs by analyzing a larger instruction buffer (Reservation Station Entry). Unfortunately, it is very difficult to extract enough parallelism with a single thread of control even for a small number of functional units.

    A single-threaded sequencing mechanism has several major limitations in exploiting program parallelism. For example, to exploit instruction-level parallelism, independent instructions from different basic blocks in a single instruction stream need to be examined and issued together. As the issue rate increases, a larger instruction buffer (RSE) is needed to contain several basic blocks, which may be control dependent on different branch conditions, but must be examined together. Instruction-level speculative execution with branch prediction is also needed to move independent instructions across basic block boundaries. However, a large RSE size requires deeper levels of branch prediction and speculation, which quickly impairs the prediction accuracy. Also, in VLIW architectures, the code size could expand exponentially since a single–threaded code needs to include all possible combinations of branch outcomes. This problem is especially serious when a compiler attempts to pipeline a loop with many conditional branches. In superscalar architectures, the processor needs to perform run-time dependence checking for both register and memory accesses. The hardware overhead for such dependence checking is very high and can grow quadratically as the size of the RSE increases. In VLIW architectures, the suboperations of each long word are executed in a lock-step fashion to enforce the dependencies between instructions. Any asynchronous event caused by any one of the suboperations, such as a cache miss, will stall all of the other suboperations in the same long word instruction, even if they are independent. These unnecessary stalls will happen more frequently as the issue rate of the VLIW processor increases.

    With these limitations in the single-threaded execution model, it is important to consider other possible models for future microprocessors which will necessarily have more functional units and higher issue rates. The emerging multithreaded architecture is one of such solutions, which integrates compilation techniques and run-time hardware support to exploit both thread-level and instruction-level parallelism in programs. This new architecture uses a thread pipelining execution model to enhance the overlap between threads. It also supports compiler-directed thread-level control speculation and run-time data dependence checking. These features allow the compiler to exploit more potential thread-level parallelism in general-purpose applications.

  3. Features of Multithreaded Architecture

The concurrent multiplethreaded architecture exploits thread-level parallelism with multiple threads of control. Now we will describe one design plan of multithread architecture, which reflects most features and advantages of this emerging architecture. In its general form, a multithreaded processor is comprised of a number of thread processing elements connected to each other with a unidirectional ring, as show in following:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The multiple thread processing elements each have a private level-one instruction cache, but share the level-one data cache and the level-two cache. There is also a shared register file that maintains some global registers and a lock register. At run-time, the multiple thread processing elements, each with its own program counter and instruction execution path, can fetch and execute instructions from multiple program locations simultaneously. Each thread processing element also has a private memory buffer to cache speculative stores and to support run-time data dependence checking. Following is the microarchitecture of a thread processing element:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The compiler statically partitions the control flow graph of a program into threads that correspond to a portion of the control flow graph. A thread performs a round of computation on a set of data which has no, or only a few, dependencies with other concurrent threads. The compiler determines the granularity of the threads, which are typically one or several iterations of a loop.

The execution of a program starts from its entry thread. This thread can then fork a successor thread on another thread processing element. The successor thread can further fork its own successor thread. This process continues until all thread processing elements are busy.

When multiple threads are executed on the multithreaded processor, the oldest thread in the sequential order is referred to as the head thread. All of the other threads derived from it are called successor threads. After the head thread completes its computation, it will retire and release the thread processing element. Its successor thread then becomes the new head thread. The completion and retirement of the threads must follow the original sequential execution order. In the multithreaded execution model, a thread can fork one of its successor threads with or without control speculation. When forking a successor thread without control speculation, the thread performing the fork operation must ensure that all of the control dependencies of the newly generated successor thread have been satisfied. If a thread forks a successor thread with control speculation, however, it must later verify all of the speculated control dependencies. If any of the speculated control dependencies are false, the thread must issue a command to kill the successor thread and all of its subsequent threads.

Normally the multithreaded architecture uses a thread pipelining execution model to enforce data dependencies between concurrent threads. Unlike the instruction pipelining mechanism in a superscalar processor, where instruction sequencing, data dependencies checking, and forwarding are performed by processor hardware automatically, the multithreaded architecture performs thread initiation and data forwarding through explicit thread management and communication instructions. The execution of a thread in the multithreaded model is partitioned into several stages, each of which performs a specific function. The execution stages of a thread and the relationship between concurrent threads are shown as follows:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

After a thread is initiated by its predecessor thread, it begins executing its continuation stage. The major function of this stage is to compute recurrence variables, such as loop index variables, needed to fork the next thread. The values of these variables will be forwarded to the next thread processing element before the next thread is activated. In the case of a DO loop, the index variables, such as i=i+1 or p=p->next, will be computed and forwarded to the next thread processing element. The continuation stage of a thread ends with a fork instruction, which causes the next thread to begin.

A thread may perform store operation on which later concurrent threads could be data dependent. These store operations are refered to as target stores (TS). To reduce hardware complexity, many implementations of the multithreaded model does not allow speculation on data dependencies. To facilitate run-time data dependence checking, the addresses of these target stores need to be calculated as early as possible. The target-store-address-generation (TSAG) stage performs the address computation for these target stores. These addresses are stored in the memory buffer of each thread processing element and are forwarded to the memory buffers of all succeeding concurrent threads. After a thread completes the TSAG stage and all of the target store addresses have been forwarded, it sends a tsag-done flag to the successor thread. This flag informs the next thread that it can start the computation that may be dependent on the predecessor threads. Before receiving the tsag_done flag, a thread can only perform the computation that will not depend on any of the target stores of its active predecessor threads. To increase the overlap between threads, the target-store-addresses-generation stage can be further partitioned into two parts. The first part is for target store addresses generations that do not have any data dependencies on earlier threads, computed quickly and forwarded to the next thread. The second part computes unsafe target store addresses that may be data dependent on an earlier thread. These computations must wait for the tsag_done flag from the predecessor thread before beginning.

The computation stage performs the main computation of a thread. If the address of a load operation matches that of a target store entry in its memory buffer during this stage, the thread will either read the data from the entry if it is available or it will wait until the data is forwarded from an earlier concurrent thread. On the other hand, if the value of a target store is computed during this stage, the thread needs to forward the address and the data to the memory buffers of all its concurrent successor threads. The computation stage of a thread ends with a stop instruction.

If the control dependencies are cleared after the computation stage, the thread completes its execution by writing all of the data from the store operation in its memory buffer to memory, including data from both target and regular stores. Data from store operations need to be kept in the memory buffer until this write-back stage to prevent the memory state from being altered by a speculative thread that is subsequently aborted by an earlier concurrent thread due to an incorrect control speculation. To maintain the correct memory state, concurrent threads must perform their write-back stages in their original order. That means a thread must wait for a wb_done flag from its predecessor thread before it can perform its write-back stage. It also needs to forward a wb_done flag to the next thread after it completes its own write-back stage. Because all of the stores are committed thread-by-thread, write-after-read and write-after-write hazards cannot occur during run-time.

  1. Example Program running on the multithreaded architecture
  2. Now let's see an example to show how programs are compiled and executed on a multithreaded processor. The code segment shown in follows is one of the most time-consuming loops in the normal program:

    while( funct_units[i].class!=ILLEGAL_CLASS ) {

    if( f->class==funct_units[i].class ) {

    if( minclk>funct_units[i].busy ) {

    minclk=funct_units[i].busy;

    j=i;

    if( minclk==0 ) break;

    }

    }

    i++;

    }

    This is a while loop with exit conditions in the loop head as well as the loop body. There is a potential read-after-write data dependence across loop iterations caused by the variable minclk. This loop is very difficult to parallelize by using conventional software pipelining techniques because of its control-flow intensive loop body and the conditional loop-carried data dependence. However, with the help of architectural support for multiple threads of control, control speculation, and run-time dependence checking, this loop can be easily parallelized and executed on a multithreaded architecture.

    Following is the multithreaded code specially compiled for the loop. In this code, each thread corresponds to a loop iteration.

    /* Continuation Stage */

    L1:

    i_1=i;

    STORE_TS(&i, i_1+1); /* stores the data to the memory buffer entry and forwards */

    /* the address and the data to the next thread processing */

    /* element */

    fork L1;

    /* Target-Store-Address-Generation Stage */

    ALLOCATE_TS(&minclk); /* allocates a target store entry for minclk in the memory */

    /* buffer and forwards the address to the next thread */

    /* processing element */

    WAIT_TSAG_DONE;

    RELEASE_TSAG_DONE;

    /* Computation Stage */

    if( funct_units[i_1].class == ILLEGAL_CLASS ) {

    ABORT_FUTURE;

    i = i_1;

    goto L2;

    }

    if( f->class == funct_units[i_1].class ) {

    if(minclk > funct_units[i_1].busy) {

    STORE_TS (&minclk, funct_units[i_1].busy);

    j=i_1;

    /* if minclk is zero, break to terminate search */

    if ( minclk ==0) {

    ABORT_FUTURE;

    i=i_1;

    goto L2;

    }

    }

    else

    RELEASE_TS(&minclk);

    }

    else

    RELEASE_TS (&minclk);

    STOP;

    /* Write-back Stage */

    /* ->performed automatically after stop */

    /* End of thread pipelining */

    L2:

    The execution of each loop iteration is transformed into three explicit thread pipelining stages and an implicit write-back stage. In the continuation stage, each thread increments the recurrence variable i and forwards its new value to the next thread processing element. The original value of i is saved in i_1 for later use locally. The continuation stage ends with a fork instruction to initiate the successor thread. In each thread, there is only one target store corresponding to the update of the variable minclk. The address of the variable minclk is forwarded to the next thread in the TSAG stage. Since the TSAG is not dependent on predecessor threads, it can proceed immediately after the continuation stage. However, the computation stage needs to wait until it receives the tsag_done flag from its predecessor thread. In the computation stage, a thread first begins by checking if the first exit condition is true. If it is, the thread will abort its successor threads and then jump out of the loop. Otherwise, the thread will perform the computation of the loop body. In the computation stage, the update to the variable minclk is performed with a store_ts instruction, which will forward the result to the successor threads. If the control path that execute the store_ts is not taken, the thread will execute a release_ts instruction to release the target store entry so that the successor threads will not wait for the corresponding target store data. If both exit conditions are false, the thread will execute a stop instruction, and then automatically perform the write-back.

  3. Compiler Techniques for The Multithreaded Architecture
  4. To fully utilize the hardware support for thread-level speculative execution and run-time data dependence checking, the multithreaded architecture relies on the compiler to extract thread-level parallelism and to generate multithreaded code. Given a sequential program, the compiler first partitions the execution of the program into threads for concurrent execution. The compiler then performs thread pipelining to facilitate run-time data dependence checking. Both tasks require powerful program analysis and transformation techniques. Many of these techniques have been previously developed for traditional parallelizing compilers. Compilers for multithreaded architecture leverages many of these techniques. In addition to generating multithreaded code, the compiler can further enhance parallelism between threads and reduce run-time data dependence checking overhead by applying some program transformation unique to the multithreaded processor. Such techniques include conversion of data speculation to control speculation, distributed heap memory management, using critical sections for order-independent operations and memory buffering in the main memory. So the multithreaded architecture is a compilation technique requirement-intensive architecture, which need advanced compilers to support its achievement of high parallelism.

  5. Performance Evaluation of Multithreaded Architecture
  6. To test and evaluate the performance of multithreaded architecture, we can run original benchmark programs on the single-threaded superscalar processor models and run the transformed multithreaded programs on the multiple threaded models. Following chart is a comparison of IPC (Instructions Per Cycle) on multithreaded and single-threaded processors.

    We can see that the multithreaded model can further improve the performance of a single threaded superscalar architectural model for many programs, especially for those with high thread-level parallelism (with little intensive loop-carried data dependencies). So we can benefit from this architecture in many application fields.

  7. Applications and Advantages of Multithreaded Architecture

Exploiting more parallelism from programs is the key to improve the performance of future microprocessors. While the developed instruction-level parallelism available in a basic block or a small set of basic blocks is very limited, there is far more loop-level parallelism available in most programs. The concurrent multithreaded architecture models, which can exploit loop-level parallelism efficiently, have a great potential to be used in future microprocessor designs.

In addition, another very important application of multithreaded architecture is that the advent of hardware support for multithreading in microprocessor can also alleviate both the latency and hardware complexity of event handling. When the hardware detects an event in a traditional single-threaded processor, an expensive sequence of actions must take place even before any of the event handling code is executed. The pipeline must be halted and drained, the stalled thread's registers and protection domain must be saved to memory, and the event handler's state and protection domain must be restored. Once the event handler has completed, this process must be reversed. Instead, with the appropriate mechanism, a multithreaded processor can avoid most of this overhead by forking the event handler into a separate hardware context. Multithreading also provides the opportunity to execute applications and event handlers simultaneously. Using multithreading for event handling has two advantages. First, it reduces the overhead to invoke and return from an exception since the faulting thread is not removed from the pipeline. Second, event handling can proceed in parallel with the application, enabling faster event execution for both internal and external events. Hardware support for concurrent event handling allows the traditional requirements for sequentially precise interrupts in microprocessors to be relaxed, improving performance without sacrificing correctness. Interleaving execution of event handlers and an application will help improve program throughput and reduce overall execution time. Especially for applications that exhibit a high event rate, it shows substantial performance improvements when using concurrent event handling.

As a conclusion, the common single-threaded execution model limits processors to exploiting only the relatively small amount of instruction-level parallelism available in application programs. At the same time, this execution model unnecessarily hinders the performance of exception handling by stalling instructions in the pipeline and context switching. The concurrent multithreaded processor, on the other hand, can exploit the multiple granularities of parallelism available in application programs and highly improve the performance of external events handling. So multithreaded architecture is a trend for future microprocessor design and it has been attracting more and more interests and emphasizing of microprocessor-design industry.

 

 

Nomenclature: