Aparapi cpu gpu benchmarks: 2013

Saturday, 5 January 2013

Floyd Warshall ( min graph distance )

In today post i will present my result of the Aparapi GPU parallel version of the classical Floyd Warshall min graph distance. We will test on 3 complete graphs with 1024, 2048 and 4096 nodes.
The coding of the parallel version was very easy because the algorithm is data parallel by its nature, so what i have done was just to execute the 2 inner loop in a kernel N times ( N=number of nodes ).
We will compare the classical iterative version and the Aparapi OpenCL version executed on the CPU and GPU. Lets see the results.

For a graph with 1024 notes ( this algorithm has a N^3 complexity so 1024 is a pretty large number )

The first thing that i noticed is that the iterative version is very fast, in total there are 1073741824 iterations to be done and the iterative version finished in 1406 ms, this means 763685 iterations per ms. Taking into account that just starting a kernel ( without memory copy or OpenCL translation ) takes around 0.6 ms on my system i think that running a kernel for performing anything less that 500 000 will run slower that an iterative version ( this number clearly differs from one system to another, but i think is in this range ).
Because of this the speed increase for the GPU version is just 2X.
One strange thing is the very bad performance for the Aparapi CPU version, I' am not sure what is causing this, i currently assume that is has to do with the algorithm not being very cache friendly when run on the CPU.

The gap has increased to 6.6X between the iterative version and the Aparapi version.

Another speed increase of around 10x.

Conclusions

This test proved that even if we need to run the same kernel multiple times we can still get a very good speed increase ( 10X ) the only condition is that when we start a kernel we need to have a lot of data to process, i would say at least a couple of millions operations for a marginal speed increase and around billions for a really big increase in performance.

Results

Java code

Wednesday, 2 January 2013

Levenshtein distance

In this test i tried to adapt the classic Levenshtein_distance algorithm to Aparapi. What i have did was to basically calculate in parallel the values for each color.
Lest say:
A = the first string,
B = the second string
then my algorithm runs the same kernel N = ( len(A)+len(B)-1) times ( this number is equals with the number of different colors ).

I executed the algorithm on some random generated string and here are the results:

The GPU time is 37X slower than the iterative version for the 1024*4 length strings and 14X slower for the 1024*14 length string. Why is this?
I did some testing and the 2 most probably caused are:
a) Kernel start time, if you run the same kernel multiple times even if it doesn't do nothing it takes around 0.6 ms when run in GPU mode and 0.125 ms when run in CPU mode ( the 2 times are without buffer copy and java to opencl code translation ).
b) Low number of operations executed by each kernel run. The maximum number of operations executed at on time is len(B).

What did i learned from this? When you start a kernel make sure it has a lot of data to process at least millions if not billions of operations. If you start a kernel to make 10 000, 100 000 even 1 000 000 if will be slower than running that code on the CPU.

Results
Java code

Tuesday, 1 January 2013

Large prime calculator

Today we will test how fast we can check if a large number ( 19 digit ) is prime. I written a classical sequential algorithm and an Aparapi parallel version and here are the results:

There is a massive 50X improvement over the sequential algorithm and a very good 14X improvement over the parallel version run on CPU. I don't know why we didn't see the same 40X improvement like in the previous 2 algorithm but 14X is still very good.
The sequential version just checks each number for 3 to sqrt(n) if it divides n.
The parallel version splits the interval [3, sqrt(n)] in 128*1028 equal intervals and the number from each interval are checked by a " GPU thread " ( I'am not sure what is the right terminology for this ).
Probably you wonder why 128*1028 intervals? I tested with different number of interval and this offered the best performance, this is probably caused by the fact the GPU is very efficient in switching between and allocating resources to threads and a larger number of threads will maximize the resource utilization ( this is just my empirical explanation )

Results ( spreadsheet ) and Java code