A solution has been proposed to improve the performance of pclust protein clustering algorithm.
This solution makes use of OpenMP library and involves the following steps: i) Identifying the contention spots in the algorithm. ii) Determining how parallelization can be used to reduce or eliminate contention. iii) Applying OpenMP constructs to the algorithm. iv) Testing the parallelized algorithm for errors such as race condition and comparing its performance to the serial algorithm. v) Verifying the results produced by the parallelized algorithm. The algorithm involves a 2-pass Shingling process. The main idea of the Shingling algorithm is as follows: Intuitively, two vertices sharing a shingle. The algorithm seeks to group such vertices together and use them as building blocks for dense sub graphs *.
The input to the algorithm is a FASTA file with n sequences, variables s and c. Variables s and c denote the size of shingle and the number of trials respectively. Larger the value of s, lesser the probability that two vertices share a shingle. The parameter c is intended to create the opposite e?ect*. We start the parallelization process by modifying the init_vars function which is used to allocate memory to different variables. In the following code, allocation of one variable is completely independent from the allocation of other variables so rather than executing these statements serially, they can be run parallelly on different processors using the section construct.
Consider the following code:#pragma omp parallel #pragma omp sections #pragma omp section vidmap = emalloc(gN*(sizeof *vidmap)); #pragma omp section gA = emalloc(gC*(sizeof *gA)); #pragma omp section gB = emalloc(gC*(sizeof *gB)); #pragma omp section mfSglCnt = gN*gC; gFSgl = ecalloc(mfSglCnt, sizeof *gFSgl); #pragma omp section n2gidHash = ecalloc(gN, sizeof(*n2gidHash)); //end of sections Next we parallelize the free_vars function which is used to deallocate the variables. Here we are using the same approach as init_vars. However instead of using separate sections for each free (memory deallocation) statement, we put four free statements inside one section. This will schedule four free statements to a single processor.
We do this because the deallocation process is relatively less time taking. So if we schedule each free statement to single processor, the overhead increases which is undesirable. Consider the following code: #pragma omp parallel #pragma omp sections #pragma omp section free(vidmap); free(gA); free(gB); free_union(uSet); #pragma omp section free_sgl(gFSgl, fSglCnt); free_sgl(gSSgl, msSglCnt); free_hash(); free_gid_hash(n2gidHash, gNN); //end of sections We are only adding OpenMP constructs to the code and not modifying the logic of the algorithm until required. It is also important to note that some parts of the algorithm cannot be parallelized due to presence of I/O bound statements. Parallelization can only be performed on CPU bound statements. For example, consider the function shingle which adds a lot to the total overhead due to presence of many for loops which are highly dependent on I/O. It is very evident that for loops are the major contention spots in a program. Optimizing these loops can help to improve the run-time performance of the code.
One method of optimizing them can be by splitting the iterations and scheduling them on multiple processors. Functions like free_hash(), free_gid_hash(), free_adjList(), free_sgl(), init_union(), init_vidmap() have for loops with CPU bound statements which can effectively be parallelized by using #pragma omp parallel for directive. Parallelizing these loops effectively reduce the time for which these loops run thus improving the overall performance.
Consider the following for loops:1) FREE_HASH (): ….. #pragma omp parallel for schedule(dynamic,500) num_threads(4) shared(i) for(i=0; i parent = i; ufSeti.rank = 0; //End of pragma for return ufSet;In both the code snippets, shared(i) has been used because variable i has to be shared among all the threads. Schedule(dynamic, n) means that n iterations will be dynamically allocated to any one of the available processors. Apart from the for construct, we also used constructs like task
parent = i; ufSeti.rank = 0; //End of pragma for return ufSet;In both the code snippets, shared(i) has been used because variable i has to be shared among all the threads. Schedule(dynamic, n) means that n iterations will be dynamically allocated to any one of the available processors. Apart from the for construct, we also used constructs like task