# High Peformance Computing BE Computer (2012 Course) Insem 2016

### High Performance Computing

#### Time: 1 hour                                                            Marks: 30

Q1) a) Explain Control Structure of Parallel Platform in detail.

• First, we understand the Term Course and Fine Granularity.
• Course Granularity: When a system is divided into a smaller number of large parts then it is Course  Granularity. It refers smaller no of subtasks.

Fine Granularity: When a system is divided into a large number of small parts then it is Fine Granularity.

2) In Parallel Computing Granularity is used at different layers. Granularity describes splitting of tasks into no of smaller subtasks.

3) A Parallel program is a collection of different programs or instructions which are to be executed in Parallel.

4) The working of processing units in Parallel computers is divided into two ways they are:

1) SIMD

2) MIMD

Let discuss this two in detail

1) SIMD:

In Single Instruction Multiple Data Stream, one instruction works on multiple data streams simultaneously with the help of processing elements (PE).

The structure of SIMD is as:

Fig. SIMD architecture

The single control unit is used to send a different instruction to different PE. SIMD is used to read instruction by program counter after it decodes them and send the signals to processing elements.

Fig. Processor array

All units are connected to each other by an interconnection network which also manages the data transmission. Memory is used to take data. The number of the data path depends upon the number of PE.

2) MIMD

In Multiple Instruction Multiple Data Stream, multiple instructions are executed on multiple data streams simultaneously. It can execute multiple programs at same time. It is used in Parallel machines. In MIMD every processor fetches its own instruction and work on its own data.

Fig. MIMD architecture

Q1) b) Explain the Basic working principle of Super-Scalar Processor. (6 marks)

• In Superscalar Processor architecture, one central register file is used for reading operand and writing results to it by the execution unit which becomes visible to each execution unit in next cycle. Through this different unit of operations can be executed.
• Let discuss different unit in detail:

1) ILP (Instruction-level Parallelism) in superscalar processors

Bypassing hardware is used for minimizing the delay between dependent instructions. It forwards the output of an instruction to all execution units.

2) Instruction Issue Logic:

Instruction Issue Logic is used to store instruction of programs which supply instructions to each unit in Parallel.

3) Hardware used:

Instruction Issue Logic is used to find out which instruction is provided in the cycle.

4) ILP strengths and weakness:

Parallel execution helps in achieving speedup while executing programs. Improving performance depends upon limitation imposed by instruction dependencies.

5) Addition of Execution Units to the processor:

A number of addition of execution units minimizes the instruction execution times. For example, the switching from one instruction unit to another minimizes execution time but as the execution units increases, most of them remain idle.

6) Multiple Issue Processors

A superscalar processor is also used in microprocessors which may support greater than four issue superscalars for achieving Instruction-level Parallelism.

7) Pipeline in Superscalar

Fig. superscalar pipeline

IF: Instruction Fetch

ID: Instruction Decode

Branch Prediction Logic provides control independencies for instructions. Register renaming is also used. There are two sections in superscalar pipeline those are in- order section and out of order section in which instruction is executed in an order which is different from one mentioned in the program.

OR

Q2) a) Differentiate between Write- Invalidate and Write-Update protocol. (6 marks)

Write-Invalidate:

1) The processor which writing the data which result in copies of caches of all processors that to be rendered invalid before it changes its local copy.

2) It has the limitation of false sharing. After cache copy invalidation, data stored on the local machine is possible to update until other processor requests it.

3) It is profitable in a situation where we want multiple writes to the same block with single processor migratory is sharing from one processor to another.

4) Write invalidates takes less bandwidth. On cache block basis and block ownership, Cache Coherency is maintained.

Write Update:

1) Writing process broadcasts the new data on the bus then those caches that have copies of data are updated.

2) It is good for Inter-Processor Contention.

3) Write update takes more bandwidth. It creates for the local copies for a write operation.

Fig. Before Update

Fig. after write back

Q2) b) Describe the merits of multi-threading over pre-fetching technique. (4 marks)

1) Multithreading requires no software support.

2) Multithreading can deal with complex access patterns in a condition where it is difficult to predict accesses ahead of time because it only gives a response to misses instead of predicting them.

3) Multithreading tolerates the latency trying to overlap latency of one context by the computation of another context.

Q3) a) Explain Recursive Decomposition with suitable example. (5 marks)

Ans: Recursive Concurrency is based on giving concurrency in problems which handle in divide and conquer strategy. First, the problem is divided into different subproblems then each subproblem is again divided further and solved individually using recursion.

Example:

1) A function can call itself in programming. This is called as recursion.

public class raw1 {

public static void main (String [] args) {

a ();

}

static void a (){

System.out.println(“example”);

a();

}

}

• Merge sort: In merge sort, first an array is divided into two parts then sorting is carried out and these two parts again divided further recursively. Finally, all sort units are merged to get the required final array which is our output.

Q3) b) Explain graph partitioning with suitable example (5 marks)

• For divide and conquer algorithms, graph partition first started as a preprocessing step where tasks are divided into the same size of small tasks. When we need to cluster the vertices into logical parts for storage, we can use graph partition.
• Let consider the problem of multiplying a sparse matrix with vector where we partition the graph to assign equal numbers of nodes to every process with decreasing edge count of partition.
• For graph partition, first, develop the initial partition of vertices then sweep through the vertices, deciding whether the size increased or decreased if we moved the vertex to another side.
• Example: here we take the dispersion of water contaminant in the lake which includes computing the level of contamination at every vertex of mesh at different intervals of time.
• At input, we take a weighted graph G= (V, E) and integer a, b, c and we partition the vertices in a (a is an integer) subset such a way that each one has the size at most b and cost of edge bounded by c.

OR

Q4) a) Differentiate between Spatial and Temporal locality of reference. Enlist application where it is useful. (6 marks)

Spatial Locality:

• When line size increases, cache miss traffic does not increase.
• If the memory location is referenced in particular time then maybe neighboring memory locations will be referenced in future.
• Neighbouring items of the item which is referenced have a high chance of reference in future.
• Also, Resources which are near previously referenced resources have a high chance of to be referred.
• If data dependency does not exist between two nodes then they can be executed concurrently.

Temporal Locality:

• When size increases, cache miss traffic decreases.
• Temporal locality is a locality over time. At a particular time if a memory location is referenced then there is a chance of referencing it again in future.
• Also, if a resource is referenced one time then it may reference in future.
• Temporal Parallelism results from pipelining waves of computation through the graph.

Q4) b) Write a note on NVIDIA Tesla GPU. (4 marks)

1) NVIDIA Tesla GPU is based on scalable processor array. GPU is used along with central processing unit for faster performance of the application. In GPU, thousands of cores are present as compared to CPU. GPU is also used for 3D applications.

2) GeForce 8800 GPU contain total 128 SP (streaming processor) which organized as 16 SM (streaming multiprocessor) in total 8 units they are texture/processor clusters (TPCs).

3) External DRAM control and fixed function raster operation processors perform depth and color operation into the memory.

4) Workflow of GPU is top to bottom with initial phase is host interface and it goes up to system PCI-express bus.

Q5) a) Illustrate MPI routines. (4 marks)

• We can define Syntax and semantics of library routines by MPI standard. All Parallel computers contain vendor implementation of MPI.
• We can write message passing programs with six routines.
1. a) MPI_Init:
2. b) MPI_Send
3. c) MPI_Recv
4. d) MPI_Comm_rank
5. e) MPI_finalize
6. f) MPI_Comm_size

Let discuss all one by one:

1. MPI_Init: It is used for initialization.

Syntax: int mpi_Init(int *argc,char **args)

1. b) MPI_Send: Use to send the message.
2. c) MPI_Recv: Use to receive the message.
3. d) MPI_Comm_rank: Use to decide label of calling process.

Syntax: MPI_Comm_rank(MPI_COMM_WORLD, &rank);

1. e) MPI_finalize: Use for termination.
2. f) MPI_Comm_size: Use to decide number of processes.

Syntax: MPI_Comm_size (MPI_COMM_WORLD, &size);

Q5) b) Write the note on Topologies and Embedding. (6 marks)

• We can represent process topology by graphs. 1, 2, 3D topologies are used in message passing programs which are denoted as Cartesian topologies.
• Processes are connected with their neighboring virtual grid and they can be recognized by Cartesian coordinates.
• Function for creating Cartesian topology is:

Int MPI_cart_create (MPI_Comm comm_old, int ndims, int *dims, int *periods, int recorder, MPI_Comm *comm_cart). First, it takes all the processes into old communicator then creates a new communicator having ndims dimensions. Through this, now we can identify every processor in new Cartesian topology.

• MPI sees processes are arranged in linear order while Parallel programs in high dimension communicate naturally. We can find out such mapping by interaction pattern and topologies of the machine.
• MPI enables programmers assemble processors in logical n-dimension meshes.
• Topology coordinates are used to identify every process in Cartesian topology. By using ranks in MPI we can specify source and destination of processes.
• To convert ranks to Cartesian coordinate and vice versa, MPI provides some routines to do so.
• MPI_Cart_shift function in MPI is used to calculate the rank of the source and destination processes.

OR

Q6) a) Explain Non-blocking communications using MPI. (4 marks)

1) Sender and receiver operations do not block the involved processes till their completion, for this to be carried out we use non-blocking communication.

2) First sender process invokes send () but it does not wait until completion of send operation and continue towards the next task.

3) Non-blocking send returns control to calling process so that other parts of the system can take control. After starting send () operation sender process needs to handle send complete call. When sending returns then sender continue its other tasks while in this period send perform carry operation on data.

4) Sender process verifies that data has been copied from send buffer or not, this task is handled by sending complete call.

5) MPI_Isend () and MPI_Irecv () function are used in non-blocking which start their operation but immediately returns control before task completion. I stand for initiate in above functions.

6) MPI_Test () is used for checking completion status and MPI_Wait () is used for giving waiting status until the operation completes its execution.

Q6) b) What are principles of message passing programming? (6 marks)

• Message passing programs generally are written using:
1. Asynchronous Paradigm: All concurrent tasks make use of asynchronous paradigm which helps in implementation of Parallel Algorithms. In asynchronous paradigm, Programs are non-deterministic in behavior because of race condition.
2. Loosely Synchronous Programs: In this, the subset of tasks synchronizes to perform interactions and in between these, tasks execute asynchronously. Many Parallel algorithms are implemented using loosely synchronous programs.
• Numbers of Sequential Programming Paradigms are considered together for message passing paradigm.
• Here processors have their own memory. Programs modules are designed in such a way that they can run on each processor.
• Communication is needed if modules are running on different processors which occur through messages.

Fig. system based for message passing paradigm

• In this, memory is not shareable thus processors cannot access each other’s memory. When there is data movement from one module to another then message transfer occurs.
• For providing information required for such message transfer, each processor needs to communicate with each other.
• Every process has permission for performing computation on its own data in local phase while for synchronization and communication in the processor, the global phase is allowed but computation in any form is not allowed in global phase.

### One Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.