CUDA Programming/ProjectDescription

From CS486wiki
Jump to navigationJump to search

← Back to project main page

Evaluating the Performance of GPGPUs and Their Use in Scientific Computing

Introduction

Computational science (or scientific computing) is the field of study concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyse and solve scientific problems. Scientists and engineers develop computer programs, application software, that model systems being studied and run these programs with various sets of input parameters.[0] Applications from scientific computing often require a large amount of execution time due to large system sizes or a large number of iteration steps. The execution time can be significantly reduced by a parallel execution on a suitable parallel or distributed execution platform. [1] Historically, people in the scientific area used supercomputers or computer grids to carry out these computations. However, with the advancements in computer graphics, graphics processing units became much efficient and powerful. Because of the nature of graphical data, GPUs became more specialized in handling complex matrix calculations and doing massive mathematical computations. As the processing power of GPUs has increased, so has their demand for electrical power. This problem has lead researchers to look for alternative solutions and parallel programming has been adopted by many scientists to further optimize the performance. Nowadays, GPU is especially well suited to address problems that can be expressed as data-parallel computations with high arithmetic intensity. Many applications that process large data sets such as arrays or volumes can use a data-parallel programming model to speed up computations. These applications include, for example[2]:

  • Seismic simulations
  • Computational biology
  • Option risk calculations in finance
  • Medical Imaging
  • Pattern recognition
  • Signal processing
  • Physical simulation

Ackermann et al. [3] have developed a computational approach to allow massively parallel simulation of biological molecular networks that leverage the massively-parallel computing power of modern graphics card. They have demonstrated that the parallelization on the GPU has showed a speedup of about factor 59 compared to a CPU implementation executed on a standard PC. Davis et al. [4] have carried out water simulations on GPUs and compared the performance gained using a GPU versus the same simulation on a single CPU or multiple CPUs. According to their results, their GPU implementation performs ~7x faster then on a single CPU. Another research on data normalization, done by Rodríguez et al. [5], suggests that their implementation of a quantile-based normalization method for high density oligonucleotide array data based on variance and bias running on a GPU leads up to a speed-up factor exceeding 7x versus the counterpart methods implemented on CPUs.

Research Description

Purpose

The purpose of this research project is to illustrate the performance gain of using GPUs in general purpose computing compared to the performance of CPUs.

Problem

The problem I'll be working on to test the hardware is “cluster analysis of gene expressions”. Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense [6]. A gene is a segment of DNA, which contains the formula for the chemical composition of one particular protein. The large majority of abundantly expressed genes are associated with common functions, such as metabolism, and hence are expressed in all cells. However, there will be differences between the expression profiles of different cells, and even in a single cell, expression will vary with time, in a manner dictated by external and internal signals that reflect the state of the organism and the cell itself [7]. A natural basis for organizing gene expression data is to group together genes with similar patterns or expression. For any series of measurements, a number of sensible measures of similarity in the behavior of two genes can be used [8]. This information, then, can be used by the experts in biological sciences to gather further knowledge in the area. This situation makes cluster analysis the best candidate for extracting the information out of gene expressions.

Methodology

For testing purposes, I used three different clustering programs; one is a single threaded ANSI-C complaint program and the other two are programs that use CUDA[9] and OpenCL[10] parallel programming APIs respectively. For the C program, I used Cluster 3.0 [11] software. The CUDA and OpenCL implementations are done by me.

The clustering algorithm used in this project is hierarchical clustering with Euclidean distance[12] as a distance metric and single linkage[13] as a linkage method.

The gene data is gathered from Gene Expression Omnibus Data Set Record 3345 [14]. Then the following data with given rows x columns are generated: 4096x16, 8192x16, 16384x16, 4096x32, 8192x32, 16384x32, 4096x64, 8192x64, 16384x64

Evaluation

Evaluation of the work is based on performance metrics used in evaluation of processing units (CPUs and GPUs). Please see Benchmarking Tools section of the wiki for more detailed info.

Results

Results showed us that the program written using CUDA API performed significantly better than OpenCL and Cluster 3.0. The speedup of CUDA compared to OpenCL was between 2 - 8 times, and compared to Cluster 3.0 was between 3 - 20 times. It can be argued that the performance difference between CUDA and OpenCL comes from the fact that OpenCL library is merely a wrapper around CUDA library.

References: