Multithreaded Algorithms

Analysis and Design of Algorithms

Jesus A. Gonzalez

Jul7 17, 2019

Multithreaded Algorithms

Chapter 27 of Introduction to Algorithms (3rd Edition), Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.

Multithreaded Algorithms

Most algorithms in the book are serial algorithms
- Run on a uniprocessor computer
- Execute one instruction at a time
In this chapter, we will see parallel algorithms
- Run on a multiprocessor computer
- Allows multiple instructions to execute concurrently
- Explore an elegant model: dynamic multithreaded algorithms

Multithreaded Algorithms

Parallel computers
- Computers with multiple processing units
- Very common, with different prices and performance
- Relatively inexpensive:
  - Desktop and laptop chip multiprocessors
  - Single multi-core integrated circuit chip
  - Houses multiple processing cores
  - Each core is a full processor that can access a common memory
- Intermediate price/performance:
  - Clusters built from individual computers
  - Can be simple PC-class machines with a dedicated network interconnecting them
- Highest priced machines:
  - Supercomputers
  - Combination of custom architectures and custom networks
  - Designed to deliver the highest performance (instructions per second, i.e. flops)

Multithreaded Algorithms

Multiprocessor computers have been around for a while
- There is not a single model for parallel computing with wide acceptance
- Vendors have not agreed on a single parallel architectural model
  - Some feature shared memory
  - Others employ distributed memory
- With multi-core technology, new laptops and desktops are a shared-memory parallel computer
- Trend appears to be toward shared-memory multiprocessing

Multithreaded Algorithms - Static threading

Common means of programming chip multiprocessors (and other shared memory parallel computers)
- Static threading
- Provides a software abstraction of virtual processors or threads sharing common memory
  - Each thread mantains an associated program counter
  - Can execute code independently of the other threads
  - The OS loads a thread onto a processor for execution
    - Switches it out when another thread needs to run

Multithreaded Algorithms - Static threading

Static threading
- The OS allows the user to create and destroy threads
  - But these operations are very slow
  - Thus, in most apps, threads persist for the duration of a computation
  - That’s why they are called static
- Programming when using static threads
  - It’s difficult and error-prone
  - Programmer requires complex communication protocols
    - To implement a scheduler to load-balance the work
- Concurrency platforms were created to deal with this complexity
  - Provide a SW layer to coordinate, schedule, and manage the parallel-computing resources

Multithreaded Algorithms - Dynamic multithreading

Dynamic multithreading programming
- Important concurrency platform
- The model used in this lecture
- Allows programmers specify parallelism in applications without worrying about
  - Communication protocols
  - Load balancing
  - Other…

Multithreaded Algorithms - Dynamic multithreading

Concurrency platform contains a scheduler
- Load-balances the computation (automatically)
- Simplifies the work of the programmer
- Functionality of dynamic-multithreading provides 2 features
  - Nested parallelism
    - Allows a subroutne to be spawned to allow the caller to proceed while the spawned subroutine is computing the result
  - Parallel loops
    - It is like an ordinary for loop but the iterations of the loop can execute concurrently
- Key aspect of this model
  - Programmer specifies only the logical parallelism within a computation
  - Threads in the platform schedule and load-balance the computation among themselves

Multithreaded Algorithms - Dynamic multithreading

Advantages of the model
- It’s a simple extension of the serial programming model
- We just add three keywords to our pseudocode:
  - parallel
  - spawn
  - sync
- If we remove the keywords, we get the serial pseudocode for the same problem
  - The serialization of the multithreaded algorithm

Multithreaded Algorithms - Dynamic multithreading

Advantages of the model
- Provides a theoretically clean way to quantify parallelism
- Based on notions of work and span

Multithreaded Algorithms - Dynamic multithreading

Advantages of the model
- Many multithreaded algorithms involving nested parallelism follow from the divide and conquer paradigm
- We use recurrences also for multithreaded algorithms of this type

Multithreaded Algorithms - Dynamic multithreading

Advantages of the model
- Model shows how parallel-computing practice evolves
- Growing number of concurrency platforms
  - Support a variant of dynamic multithreading
  - Cilk
  - Cilk++
  - OpenMP
  - Task Parallel Library
  - Threading Building Blocks

The Basics of Dynamic Multithreading

Example: Computing Fibonacci numbers recursively
- Fibonacci numbers defined by recurrence
- \(F_0 = 0\)
- \(F_1 = 1\)
- \(F_i = F_{i-1} + F_{i-2}\) for \(i \geq 2\).

The Basics of Dynamic Multithreading

A simple, recursive, serial algorithm to compute the \(n\)th Fibonnaci number:

The Basics of Dynamic Multithreading

This algorithm repeats too much work (it doesn’t memoize)
The recursive tree of the call for Fib(6)

The Basics of Dynamic Multithreading

The recurrence for the Fibonacci algorithm
- \(T(n) = T(n-1) + T(n-2) + \Theta(1)\).
- Solution: \(T(n) = \Theta(F_n)\)
- \(F_n\) grows exponentially in \(n\)
- This is slow…, but it is illustrative for this purpose
  - The analysis of multithreaded algorithms

The Basics of Dynamic Multithreading

The 2 recursive calls in lines 3 and 4 are independent of each other
- \(FIB(n-1)\) and \(FIB(n-2)\)
- They can be called in either order
- The computation of one of them doesn’t affect the other
- They can be run in parallel
- Augment the pseudocode to indicate parallelism
  - Add concurrency keywords
    - spawn
    - sync

The Basics of Dynamic Multithreading

Pseudocode rewriten to use dynamic multithreading

The Basics of Dynamic Multithreading

Deleting the concurrency keywords (spawn and sync) to P-FIB
- Produces the FIB pseudocode
- This is known as serialization
Serialization of a multithreaded algorithm
- The serial algorithm resulting when deleting the multithreading keywords
  - spawn, sync, and parallel

The Basics of Dynamic Multithreading

Nested parallelism
- When the spawn keyword precedes a procedure call
- i.e. As in line 3
When we make a call with spawn
- Similar to a fork in unix
- The parent may continue execution in parallel with the spawned subroutine, the child
- It doesn’t wait for the child to complete in order to continue with the execution
- The child computes \(P-FIB(n-1)\)
- The parent computes \(P-FIB(n-2)\) in parallel with the spawned child
- Because \(P-FIB\) is recursive, the 2 subroutine calls themselves create nested parallelism
  - And also their children
  - Creates a tree of subcomputations, all executing in parallel

The Basics of Dynamic Multithreading

The concurrency keywords express the logical parallelism of the computation
- Indicates which parts of code may proceed in parallel
- At runtime, the scheduler determines which subcomputations run concurrently
  - Assigns available processors

The Basics of Dynamic Multithreading

The sync keyword
- The parent procedures executes the sync statement
  - In order to safely use the results returned by its children
- Sync indicates that the parent procedure must wait for all its spawned children to complete, before proceeding
- In \(P-FIB\) the procedure calls sync before the return statement

A Model for Multithreaded Execution

A multithreaded computation can be seen as
- A set of runtime instructions executed by a processor to complete a multithreaded program
A multithreaded computation can be modeled as a directed acyclic graph \(G = (V, E)\), called a computation dag

A Model for Multithreaded Execution

A Computation dag for the \(P-FIB\) procedure

A Model for Multithreaded Execution

Computation dag
- Vertices \(V\) are instructions
- Edges \(E\) represent dependencies between instructions
  - \((u,v) \in E\) means instruction \(u\) must execute before instruction \(v\)
- If a chain of instructions has no call to a parallel control keyword, we group them into a single strand
- A strand represents one or more instructions
- Instructions involving parallel control are not included in strands but appear in the structure of the dag
- If a strand has 2 successors, one of them must have been spawned
- A strand with multiple predecessors means that they were joined with a sync statement
- In general:
  - \(V\) represents the strands
  - \(E\) represents the dependencies between strands induced by parallel control

A Model for Multithreaded Execution

Computation dag
If \(G\) has a directed path from strand \(u\) to strand \(v\)
- The two strands are (logically) in series
- Otherwise, strands \(u\) and \(v\) are (logically) in parallel
We picture a multithreaded computation with a dag of strands embedded in a tree of procedure instances

A Model for Multithreaded Execution

Tree of procedure instances for \(P-FIB(6)\), no detailed structure of strands is shown

A Model for Multithreaded Execution

Zoom of a section of the previous tree, shows the strands of each procedure
Directed edges connecting strands run either
- Within a procedure
- Along the procedure tree

A Model for Multithreaded Execution

Edges of a computation dag indicate a type of dependency between strands
- Continuation edge \((u, u')\), drawn horizontally
  - Connects a strand \(u\) to its successor \(u'\) within the same procedure instance
- Spawn edge \((u, v)\), points downward in the figure
  - When a strand \(u\) spawns a strand \(v\)
- Call edge, also point downward
  - Represent normal procedure calls
- Return edge \((u, x)\), points upward
  - When a strand \(u\) returns to its calling procedure
  - \(x\) is the strand immediately following the next sync
- Initial strand, starts a computation
- Final strand, ends a computation
Difference between a spawn edge and a call edge
- Strand \(u\) spawning strand \(v\)
  - Induces a horizontal continuation edge from \(u\) to the strand \(u'\) following \(u\) in its procedure
    - \(u'\) is free to execute at the same time as \(v\)
- \(u\) calling \(v\)
  - Doesn’t induce such an edge

A Model for Multithreaded Execution

We assume our multithreaded algorithms execute on an ideal parallel computer
- Set of processors
- Sequentially consistent shared memory
Sequentially consistent
- Memory preserves the individual orders in which each processor executes its instructions
- For dynamic multithreaded computations
  - Memory behaves as if instructions were interleaved to produce a linear order to preserve the partial order of the computation dag
  - Order might differ from one run to the other (depending on scheduling) but is consistent with the computation dag

A Model for Multithreaded Execution

Performance assumptions
- Each processor has equal computing power
- Ignores the cost of scheduling
  - With sufficient parallelism the overhead of scheduling is minimal in practice

A Model for Multithreaded Execution

Performance measures
- Work
  - Total time to execute the entire computation on one processor
  - The sum of the times by each of the strands
  - In a computation dag in which each strand takes a unit of time
    - The work is the number of vertices in the dag
- Span
  - The longest time to execute the strands along any path in the dag
    - If each strand takes a unit time: the number of vertices on a longest or critical path in the dag

A Model for Multithreaded Execution

The actual running time of a multithreaded depends on
- Work and span
- How many processors are available
- How the scheduler allocates strands to processors
We denote running time on \(P\) processors using subscript \(P\)
- Running time of an algorithm on \(P\) processors: \(T_P\)
- Work is \(T_1\), in a single processor
- Span is \(T_{\infty}\), corresponds to running each strand on its own processor
  - As if having \(\infty\) number of processors
Work and span are lower bounds on running time \(T_P\) of a multithreaded computation on \(P\) processors

A Model for Multithreaded Execution

Work law: \(T_P \geq T_1/P\)
- In ideal parallel computer with \(P\) processors can do at most \(P\) units of work
- In \(T_P\) time it can perform at most \(PT_P\) work, and total work is \(T_1\)
- We have \(PT_P \geq T_1\), we divide by \(P\) and get the work law

A Model for Multithreaded Execution

Span law: \(T_P \geq T_{\infty}\)
- A \(P\)-processor ideal computer can’t run faster than a machine with unlimited number of processors
- Then, a machine with unlimited processors emulates a \(P\)-processor machine by using \(P\) of its processors

A Model for Multithreaded Execution

Speedup of a computation on \(P\) processors is the ratio \(T_1/T_P\)
- How many times faster is the computation with \(P\) processors than with 1 processor
- By work law: \(T_P \geq T_1/P\), then \(T_1/T_P \leq P\)
- Then speedup on \(P\) processors can be at most \(P\)
Linear speedup
- The speed up is linear in the number of processors
- \(T_1/T_P = \Theta(P)\)
Perfect linear speedup
- \(T_1/T_P = P\)
Parallelism of the multithreaded computation
- The ratio of the work to the span: \(T_1/T_{\infty}\)
3 perspectives of parallelism
- As a ratio: average amount of work (in parallel) for each step along the critical path
- As an upper bound: maximum possible speedup that can be achieved on any number of processors
- As a limit: on the possibility of attaining perfect linear speedup

A Model for Multithreaded Execution

Example: computation of \(P-FIB(4)\), assume each strand takes unit time
- Work is \(T_1 = 17\)
- Span is \(T_{\infty} = 8\)
- Parallelism is \(T_1/T_{\infty} = 17/8 = 2.125\)
- This means that achieving more than double the speedup is impossible (for the specific example of \(P-FIB(4)\))
  - No matter how many processors we use

A Model for Multithreaded Execution

Parallel slackness of a multithreaded computation
- Executed on an ideal parallel computer with \(P\) processors
- \((T_1/T_{\infty})/P = T_1/(PT_{\infty})\)
- Factor by which parallelism of computation exceeds number of processors in the machine

Scheduling

Require efficient scheduling
Multithreading programming model
- No way to specify which strand executes on which processor
- Rely on the concurrency platform’s scheduler
How it works
- Scheduler maps strands to static threads
- OS schedules threads to the processors
We assume that scheduler directly maps strands to processors

Scheduling

We assume an on-line centralized scheduler
- On-line: doesn’t have knowledge in advance about when strands will be spawned or when they will complete
- Centralized: knows global state of the computation at any given time
Greedy schedulers
- Assign as many strands to processors as possible in each time step
- Complete step: There are at least \(P\) strands ready to execute during a time step
- Incomplete step: There are less than \(P\) strands ready to execute

Scheduling

Theorem 27.1
- On an ideal parallel computer with \(P\) processors, a greedy scheduler executes a multithreaded computation with work \(T_1\) and span \(T_{\infty}\) in time
- \(T_P \leq T_1/P + T_{\infty}\)

Scheduling

Corollary 27.2
- The running time \(T_P\) of any multithreaded computation scheduled by a greedy scheduler on an ideal parallel computer with \(P\) processors is within a factor of 2 of optimal

Scheduling

Corollary 27.3
- Let \(T_P\) be the running time of a multithreaded computation produced by a greedy scheduler on an ideal parallel computer with \(P\) processors, and let \(T_1\) and \(T_{\infty}\) be the work and span of the computation, respectively. Then, if \(P << T_1/T_{\infty}\), we have \(T_P \approx T_1/P\), or equivalently, a speedup of approximately \(P\).

Analyzing multithreaded algorithms

Analizing the work
- Straightforward (relatively)
- The running time of an ordinary serial algorithm: the serialization of the multithreaded algorithm
Analyzing the span
- More interesting/complicated
- For 2 computations in series, their spans add to form the span of their composition
- For 2 computations in parallel, the span of their composition is the maximum of the spans of the 2 subcomputations

Analyzing multithreaded algorithms

When analyzing \(P-FIB(n)\)
- The spawned call to \(P-FIB(n-1)\) in line 3 runs in parallel with call to \(P-FIB(n-2)\) in line 4
- We express the span of \(P-FIB(n)\) as a recurrence
- \(T_{\infty}(n) = max(T_{\infty}(n-1), T_{\infty}(n-2)) + \Theta(1)\)
- \(= T_{\infty}(n-1) + \Theta(1)\)
- With solution: \(T_{\infty}(n) = \Theta(n)\)
The parallelism of \(P-FIB(n)\) is \(T_1(n)/T_{\infty}(n) = \Theta(\Phi^n/n)\)

Parallel loops

Many algorithms have loops for which iterations can run in parallel
- We can parallelize those loops using spawn and sync
- It’s more convenient to specify that the iterations of the loop can run concurrently
- We use the parallel concurrency keyword
  - Precedes the for keyword in a for loop statement

Parallel loops

Example: multiplying an \(n \times n\) matrix \(A = (a_{ij})\) by an \(n\)-vector \(x = (x_j)\)
- \(y_i = \sum\limits_{j=1}^n a_{ij}x_j\)
- for \(i = 1, 2, \dots, n\)
- We can perform all the entries of \(y\) in parallel as follows:

Parallel loops

The parallel for keywords indicate the iterations of loops may run concurrently
A compiler can implement each parallel loop as a divide and conquer routine using nested parallelism
- A call to \(MAT-VEC-MAIN-LOOP(A, x, y, n, 1, n)\)

Parallel loops

The work \(T_1(n)\) of MAT-VEC on an \(n \times n\) matrix
- The running time of its serialization
- Replace parallel for with ordinary for loops
- \(T_1(n) = \Theta(n^2)\)
- The overhead of recursive spawning in fact increases the work of a parallel loop compared to its serialization
  - But not asymptotically
The span is \(T_{\infty}(n) = \Theta(\lg n) + \max\limits_{1 \leq i \leq n} iter_{\infty}(i)\).
- The depth of the recursive call is logarithmic in the number of iterations
- For a parallel loop with \(n\) iterations
- The \(i\)th iteration has span \(iter_{\infty}(i)\)

Race Conditions

A multithreaded algorithm is
- Deterministic
  - It always does the same thing on the same input
  - Independently on how instructions are scheduled on the multicore computer
- Nondeterministic
  - If the algorithms behavior might vary from run to run

Race Conditions

Sometimes an algorithm intended to be deterministic fails
- Because it contains a determinacy race
Some notorious bugs of this type
- The Therac-25 radiation therapy machine
  - Killed 3 people and injured several more
- The North American Blackout of 2003
  - Over 50 million people without power
These bugs are hard to find

Race Conditions

Determinacy race
- Occurs when 2 logically parallel instructions access the same memory location and at least one of them performs a write

RACE-EXAMPLE should always print the value 2 (its serialization does)
- It can print the value 1

Race Conditions

We need to ensure that strands that run in parallel are independent
- Thus, they won’t have determinacy races
We could also use mutual exclusion locks or other synchronization methods
In a parallel for construct, all iterations should be independent
- Between spawn and sync
  - Code of spawned child should be independent of the code of the parent
  - Including code of additional spawned/called children

Race Conditions

Another example of wrong code, with races

Races when updating \(y_i\) in line 7
- Executes concurrently for all \(n\) values of \(j\)

Multithreaded Matrix Multiplication

SQUARE-MATRIX-MULTIPLY (previously seen in the book, page 75)

Multithreaded Matrix Multiplication

The work is \(T(n) = \Theta(n^3)\)
- The serialization of this algorithm is the SQUARE-MATRIX-MULTIPLY algorithm
The span is \(T_{\infty}(n) = \Theta(n)\)
- Parallel for loop of line 3
- Parallel for loop of line 4
- Executes \(n\) iterations of the ordinary for loop of line 6
- A total span of \(\Theta(\lg n) + \Theta(\lg n) + \Theta(n) = \Theta(n)\)
The parallelism is \(\Theta(n^3)/\Theta(n) = \Theta(n^2)\)

A Divide-and-conquer Multithreaded Algorithm for Matrix Multiplication

We can multiply \(n \times n\) matrices serially in time \(\Theta(n^{\lg 7}) = O(n^{2.81})\)
- Using Strassen’s divide-and-conquer strategy
- Multiply 2 \(n \times n\) matrices \(A\) and \(B\) to produce the \(n \times n\) matrix \(C\)
- Partitions each of the 3 matrices into 4 \(n/2 \times n/2\) submatrices
- To multiply 2 \(n \times n\) matrices it performs 8 multiplications of \(n/2 \times n/2\) matrices and 1 addition of \(n \times n\) matrices

A Divide-and-conquer Multithreaded Algorithm for Matrix Multiplication

A Divide-and-conquer Multithreaded Algorithm for Matrix Multiplication

The work of P-MATRIX-MULTIPLY-RECURSIVE
- \(M_1(n)\)
- Partition in \(\Theta(1)\) time
- Eight recursive multiplications of \(n/2 \times n/2\) matrices
- Adding the 2 \(n \times n\) matrices
- \(M_1(n) = 8M_1(n/2) + \Theta(n^2) = \Theta(n^3)\)
The span \(M_{\infty}(n)\) of P-MATRIX-MULTIPLY-RECURSIVE
- Span for partitioning is \(\Theta(1)\)
- Span of doubly nested parallel for loops (lines 15 - 17) is \(\Theta(\lg n)\)
- Max span of any recursive call is the span of one of them (they are the same size)
- The recurrence: \(M_{\infty}(n) = M_{\infty}(n/2) + \Theta(\lg n)\)
- This recurrence doesn’t fall in any of the cases of the master theorem, the solution is:
  - \(M_{\infty}(n) = \Theta(\lg^2 n)\)
The parallelism of P-MATRIX-MULTIPLY-RECURSIVE
- \(M_1(n)/M_{\infty}(n)=\Theta(n^3/\lg^2 n)\), very high

Multithreaded Algorithms

Analysis and Design of Algorithms

Multithreaded Algorithms

Multithreaded Algorithms

Multithreaded Algorithms

Multithreaded Algorithms

Multithreaded Algorithms - Static threading

Multithreaded Algorithms - Static threading

Multithreaded Algorithms - Dynamic multithreading

Multithreaded Algorithms - Dynamic multithreading

Multithreaded Algorithms - Dynamic multithreading

Multithreaded Algorithms - Dynamic multithreading

Multithreaded Algorithms - Dynamic multithreading

Multithreaded Algorithms - Dynamic multithreading

The Basics of Dynamic Multithreading

The Basics of Dynamic Multithreading

The Basics of Dynamic Multithreading

The Basics of Dynamic Multithreading

The Basics of Dynamic Multithreading

The Basics of Dynamic Multithreading

The Basics of Dynamic Multithreading

The Basics of Dynamic Multithreading

The Basics of Dynamic Multithreading

The Basics of Dynamic Multithreading

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

A Model for Multithreaded Execution

Scheduling

Scheduling

Scheduling

Scheduling

Scheduling

Analyzing multithreaded algorithms

Analyzing multithreaded algorithms

Parallel loops

Parallel loops

Parallel loops

Parallel loops

Parallel loops

Race Conditions

Race Conditions

Race Conditions

Race Conditions

Race Conditions

Race Conditions

Multithreaded Matrix Multiplication

Multithreaded Matrix Multiplication

A Divide-and-conquer Multithreaded Algorithm for Matrix Multiplication

A Divide-and-conquer Multithreaded Algorithm for Matrix Multiplication

A Divide-and-conquer Multithreaded Algorithm for Matrix Multiplication

THE END

Multithreaded Merge Sort