Jesus A. Gonzalez

Jul7 17, 2019

- Chapter 27 of Introduction to Algorithms (3rd Edition), Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.

- Most algorithms in the book are
*serial algorithms*- Run on a uniprocessor computer
- Execute one instruction at a time

- In this chapter, we will see
*parallel algorithms*- Run on a multiprocessor computer
- Allows multiple instructions to execute concurrently
- Explore an elegant model:
*dynamic multithreaded algorithms*

- Parallel computers
- Computers with multiple processing units
- Very common, with different prices and performance
- Relatively inexpensive:
- Desktop and laptop
*chip multiprocessors* - Single
integrated circuit chip*multi-core* - Houses multiple processing
*cores* - Each core is a full processor that can access a common memory

- Desktop and laptop
- Intermediate price/performance:
- Clusters built from individual computers
- Can be simple PC-class machines with a dedicated network interconnecting them

- Highest priced machines:
- Supercomputers
- Combination of custom architectures and custom networks
- Designed to deliver the highest performance (instructions per second, i.e.Â flops)

- Multiprocessor computers have been around for a while
- There is not a single model for parallel computing with wide acceptance
- Vendors have not agreed on a single parallel architectural model
- Some feature
*shared memory* - Others employ
*distributed memory*

- Some feature
- With multi-core technology, new laptops and desktops are a
*shared-memory*parallel computer - Trend appears to be toward
*shared-memory multiprocessing*

- Common means of programming chip multiprocessors (and other shared memory parallel computers)
*Static threading*- Provides a software abstraction of
*virtual processors*orsharing common memory*threads*- Each thread mantains an associated program counter
- Can execute code independently of the other threads
- The OS loads a thread onto a processor for execution
- Switches it out when another thread needs to run

*Static threading*- The OS allows the user to create and destroy threads
- But these operations are very slow
- Thus, in most apps, threads persist for the duration of a computation
- Thatâ€™s why they are called
*static*

- Programming when using static threads
- Itâ€™s difficult and error-prone
- Programmer requires complex communication protocols
- To implement a scheduler to load-balance the work

were created to deal with this complexity*Concurrency platforms*- Provide a SW layer to coordinate, schedule, and manage the parallel-computing resources

- The OS allows the user to create and destroy threads

- Dynamic multithreading programming
- Important concurrency platform
- The
*model used in this lecture* - Allows programmers specify parallelism in applications without worrying about
- Communication protocols
- Load balancing
- Otherâ€¦

- Concurrency platform contains a scheduler
- Load-balances the computation (automatically)
- Simplifies the work of the programmer
- Functionality of dynamic-multithreading provides 2 features
- Nested parallelism
- Allows a subroutne to be
to allow the caller to proceed while the spawned subroutine is computing the result*spawned*

- Allows a subroutne to be
- Parallel loops
- It is like an ordinary
**for**loop but the iterations of the loop can execute concurrently

- It is like an ordinary

- Nested parallelism
- Key aspect of this model
- Programmer specifies only the logical parallelism within a computation
- Threads in the platform schedule and load-balance the computation among themselves

- Advantages of the model
- Itâ€™s a simple extension of the serial programming model
- We just add three keywords to our pseudocode:
**parallel****spawn****sync**

- If we remove the keywords, we get the serial pseudocode for the same problem
- The serialization of the multithreaded algorithm

- Advantages of the model
- Provides a theoretically clean way to quantify parallelism
- Based on notions of
*work*and*span*

- Advantages of the model
- Many multithreaded algorithms involving nested parallelism follow from the divide and conquer paradigm
- We use recurrences also for multithreaded algorithms of this type

- Advantages of the model
- Model shows how parallel-computing practice evolves
- Growing number of concurrency platforms
- Support a variant of dynamic multithreading
- Cilk
- Cilk++
- OpenMP
- Task Parallel Library
- Threading Building Blocks

- Example: Computing Fibonacci numbers recursively
- Fibonacci numbers defined by recurrence
- \(F_0 = 0\)
- \(F_1 = 1\)
- \(F_i = F_{i-1} + F_{i-2}\) for \(i \geq 2\).

- A simple, recursive, serial algorithm to compute the \(n\)th Fibonnaci number:

- This algorithm repeats too much work (it doesnâ€™t memoize)
- The recursive tree of the call for Fib(6)

- The recurrence for the Fibonacci algorithm
- \(T(n) = T(n-1) + T(n-2) + \Theta(1)\).
- Solution: \(T(n) = \Theta(F_n)\)
- \(F_n\) grows exponentially in \(n\)
- This is slowâ€¦, but it is illustrative for this purpose
- The analysis of multithreaded algorithms

- The 2 recursive calls in lines 3 and 4 are independent of each other
- \(FIB(n-1)\) and \(FIB(n-2)\)
- They can be called in either order
- The computation of one of them doesnâ€™t affect the other
- They can be run in parallel
**Augment the pseudocode to indicate parallelism**- Add
*concurrency keywords**spawn**sync*

- Add

- Pseudocode rewriten to use dynamic multithreading

- Deleting the concurrency keywords (spawn and sync) to P-FIB
- Produces the FIB pseudocode
- This is known as
**serialization**

of a multithreaded algorithm*Serialization*- The serial algorithm resulting when deleting the multithreading keywords
- spawn, sync, and parallel

- The serial algorithm resulting when deleting the multithreading keywords

*Nested parallelism*- When the
keyword precedes a*spawn**procedure call* - i.e.Â As in line 3

- When the
- When we make a call with
*spawn*- Similar to a
in unix*fork* - The
may continue execution in parallel with the spawned subroutine, the*parent**child* - It doesnâ€™t wait for the child to complete in order to continue with the execution
- The child computes \(P-FIB(n-1)\)
- The parent computes \(P-FIB(n-2)\) in parallel with the spawned child
- Because \(P-FIB\) is recursive, the 2 subroutine calls themselves create nested parallelism
- And also their children
- Creates a tree of subcomputations, all executing in parallel

- Similar to a

- The concurrency keywords express the
of the computation*logical parallelism*- Indicates which parts of code may proceed in parallel
- At runtime, the
determines which subcomputations run concurrently*scheduler*- Assigns available processors

- The
**sync**keyword- The parent procedures executes the
**sync**statement- In order to safely use the results returned by its children

**Sync**indicates that the parent procedure must wait for all its spawned children to complete, before proceeding- In \(P-FIB\) the procedure calls
**sync**before the**return**statement

- The parent procedures executes the

- A
**multithreaded computation**can be seen as- A set of runtime instructions executed by a processor to complete a multithreaded program

- A
**multithreaded computation**can be modeled as a**directed acyclic graph**\(G = (V, E)\), called a*computation dag*

- A Computation dag for the \(P-FIB\) procedure

**Computation dag**- Vertices \(V\) are instructions
- Edges \(E\) represent dependencies between instructions
- \((u,v) \in E\) means instruction \(u\) must execute before instruction \(v\)

- If a chain of instructions has no call to a parallel control keyword, we group them into a single
**strand** - A
**strand**represents one or more instructions - Instructions involving parallel control are not included in
*strands*but appear in the structure of the dag - If a strand has 2 successors, one of them must have been spawned
- A strand with multiple predecessors means that they were joined with a
**sync**statement - In general:
- \(V\) represents the strands
- \(E\) represents the dependencies between strands induced by parallel control

**Computation dag**- If \(G\) has a directed path from strand \(u\) to strand \(v\)
- The two strands are
**(logically) in series** - Otherwise, strands \(u\) and \(v\) are
**(logically) in parallel**

- The two strands are
- We picture a multithreaded computation with a
**dag**of strands embedded in a tree of procedure instances

- Tree of procedure instances for \(P-FIB(6)\), no detailed structure of strands is shown

- Zoom of a section of the previous tree, shows the strands of each procedure
- Directed edges connecting strands run either
- Within a procedure
- Along the procedure tree

- Edges of a computation dag indicate a type of dependency between strands
\((u, u')\), drawn horizontally*Continuation edge*- Connects a strand \(u\) to its successor \(u'\) within the same procedure instance

\((u, v)\), points downward in the figure*Spawn edge*- When a strand \(u\) spawns a strand \(v\)

, also point downward*Call edge*- Represent normal procedure calls

\((u, x)\), points upward*Return edge*- When a strand \(u\) returns to its calling procedure
- \(x\) is the strand immediately following the next
**sync**

, starts a computation*Initial strand*, ends a computation*Final strand*

- Difference between a spawn edge and a call edge
- Strand \(u\) spawning strand \(v\)
- Induces a horizontal continuation edge from \(u\) to the strand \(u'\) following \(u\) in its procedure
- \(u'\) is free to execute at the same time as \(v\)

- Induces a horizontal continuation edge from \(u\) to the strand \(u'\) following \(u\) in its procedure
- \(u\) calling \(v\)
- Doesnâ€™t induce such an edge

- Strand \(u\) spawning strand \(v\)

- We assume our multithreaded algorithms execute on an
*ideal parallel computer*- Set of processors
shared memory*Sequentially consistent*

- Sequentially consistent
- Memory preserves the individual orders in which each processor executes its instructions
- For dynamic multithreaded computations
- Memory behaves as if instructions were interleaved to produce a linear order to preserve the partial order of the computation dag
- Order might differ from one run to the other (depending on scheduling) but is consistent with the computation dag

- Performance assumptions
- Each processor has equal computing power
- Ignores the cost of scheduling
- With sufficient parallelism the overhead of scheduling is minimal in practice

- Performance measures
**Work**- Total time to execute the entire computation on one processor
- The sum of the times by each of the strands
- In a computation dag in which each strand takes a unit of time
- The work is the number of vertices in the dag

**Span**- The longest time to execute the strands along any path in the dag
- If each strand takes a unit time: the number of vertices on a longest or
in the dag*critical path*

- If each strand takes a unit time: the number of vertices on a longest or

- The longest time to execute the strands along any path in the dag

- The actual running time of a multithreaded depends on
- Work and span
- How many processors are available
- How the scheduler allocates strands to processors

- We denote running time on \(P\) processors using subscript \(P\)
- Running time of an algorithm on \(P\) processors: \(T_P\)
- Work is \(T_1\), in a single processor
- Span is \(T_{\infty}\), corresponds to running each strand on its own processor
- As if having \(\infty\) number of processors

- Work and span are lower bounds on running time \(T_P\) of a multithreaded computation on \(P\) processors

**Work law**: \(T_P \geq T_1/P\)- In ideal parallel computer with \(P\) processors can do at most \(P\) units of work
- In \(T_P\) time it can perform at most \(PT_P\) work, and total work is \(T_1\)
- We have \(PT_P \geq T_1\), we divide by \(P\) and get the
*work law*

**Span law**: \(T_P \geq T_{\infty}\)- A \(P\)-processor ideal computer canâ€™t run faster than a machine with unlimited number of processors
- Then, a machine with unlimited processors emulates a \(P\)-processor machine by using \(P\) of its processors

**Speedup**of a computation on \(P\) processors is the ratio \(T_1/T_P\)- How many times faster is the computation with \(P\) processors than with 1 processor
- By work law: \(T_P \geq T_1/P\), then \(T_1/T_P \leq P\)
- Then speedup on \(P\) processors can be at most \(P\)

**Linear speedup**- The speed up is linear in the number of processors
- \(T_1/T_P = \Theta(P)\)

**Perfect linear speedup**- \(T_1/T_P = P\)

**Parallelism**of the multithreaded computation- The ratio of the work to the span: \(T_1/T_{\infty}\)

- 3 perspectives of parallelism
- As a ratio: average amount of work (in parallel) for each step along the critical path
- As an upper bound: maximum possible speedup that can be achieved on any number of processors
- As a limit: on the possibility of attaining perfect linear speedup

- Example: computation of \(P-FIB(4)\), assume each strand takes unit time
- Work is \(T_1 = 17\)
- Span is \(T_{\infty} = 8\)
- Parallelism is \(T_1/T_{\infty} = 17/8 = 2.125\)
- This means that achieving more than double the speedup is impossible (for the specific example of \(P-FIB(4)\))
- No matter how many processors we use

of a multithreaded computation*Parallel slackness*- Executed on an ideal parallel computer with \(P\) processors
- \((T_1/T_{\infty})/P = T_1/(PT_{\infty})\)
- Factor by which parallelism of computation exceeds number of processors in the machine

- Require efficient scheduling
- Multithreading programming model
- No way to specify which strand executes on which processor
- Rely on the concurrency platformâ€™s scheduler

- How it works
- Scheduler maps strands to static threads
- OS schedules threads to the processors

- We assume that scheduler directly maps strands to processors

- We assume an on-line centralized scheduler
**On-line**: doesnâ€™t have knowledge in advance about when strands will be spawned or when they will complete**Centralized**: knows global state of the computation at any given time

*Greedy schedulers*- Assign as many strands to processors as possible in each time step
**Complete step**: There are at least \(P\) strands ready to execute during a time step**Incomplete step**: There are less than \(P\) strands ready to execute

- Theorem 27.1
*On an ideal parallel computer with \(P\) processors, a greedy scheduler executes a multithreaded computation with work \(T_1\) and span \(T_{\infty}\) in time*- \(T_P \leq T_1/P + T_{\infty}\)

- Corollary 27.2
*The running time \(T_P\) of any multithreaded computation scheduled by a greedy scheduler on an ideal parallel computer with \(P\) processors is within a factor of 2 of optimal*

- Corollary 27.3
*Let \(T_P\) be the running time of a multithreaded computation produced by a greedy scheduler on an ideal parallel computer with \(P\) processors, and let \(T_1\) and \(T_{\infty}\) be the work and span of the computation, respectively. Then, if \(P << T_1/T_{\infty}\), we have \(T_P \approx T_1/P\), or equivalently, a speedup of approximately \(P\)*.

- Analizing the work
- Straightforward (relatively)
- The running time of an ordinary serial algorithm: the serialization of the multithreaded algorithm

- Analyzing the span
- More interesting/complicated
- For 2 computations in series, their spans add to form the span of their composition
- For 2 computations in parallel, the span of their composition is the maximum of the spans of the 2 subcomputations

- When analyzing \(P-FIB(n)\)
- The spawned call to \(P-FIB(n-1)\) in line 3 runs in parallel with call to \(P-FIB(n-2)\) in line 4
- We express the span of \(P-FIB(n)\) as a recurrence
- \(T_{\infty}(n) = max(T_{\infty}(n-1), T_{\infty}(n-2)) + \Theta(1)\)
- \(= T_{\infty}(n-1) + \Theta(1)\)
- With solution: \(T_{\infty}(n) = \Theta(n)\)

- The parallelism of \(P-FIB(n)\) is \(T_1(n)/T_{\infty}(n) = \Theta(\Phi^n/n)\)

- Many algorithms have loops for which iterations can run in parallel
- We can parallelize those loops using
**spawn**and**sync** - Itâ€™s more convenient to specify that the iterations of the loop can run concurrently
- We use the
**parallel**concurrency keyword- Precedes the
**for**keyword in a**for**loop statement

- Precedes the

- We can parallelize those loops using

- Example: multiplying an \(n \times n\) matrix \(A = (a_{ij})\) by an \(n\)-vector \(x = (x_j)\)
- \(y_i = \sum\limits_{j=1}^n a_{ij}x_j\)
- for \(i = 1, 2, \dots, n\)
- We can perform all the entries of \(y\) in parallel as follows:

- The
**parallel for**keywords indicate the iterations of loops may run concurrently - A compiler can implement each
**parallel loop**as a divide and conquer routine using nested parallelism- A call to \(MAT-VEC-MAIN-LOOP(A, x, y, n, 1, n)\)