Saturday 5 January 2013

Floyd Warshall ( min graph distance )

In today post i will present my result of the Aparapi GPU parallel version of the classical Floyd Warshall min graph distance. We will test on 3 complete graphs with 1024, 2048 and 4096 nodes.
 The coding of the parallel  version was very easy because the algorithm is data parallel by its nature, so what i have done was just to execute the 2 inner loop in a kernel N times ( N=number of nodes ).
 We will compare the classical iterative version and the Aparapi OpenCL version executed on the CPU and GPU. Lets see the results.

For a graph with 1024 notes ( this algorithm has a N^3 complexity so 1024 is a pretty large number )


The first thing that i noticed is that the iterative version is very fast, in total there are 1073741824 iterations to be done and the iterative version finished in 1406 ms, this means 763685 iterations per ms. Taking into account that just starting a kernel ( without memory copy or OpenCL translation ) takes around 0.6 ms on my system i think that running a kernel for performing anything less that 500 000 will run slower that an iterative version ( this number clearly differs from one system to another, but i think is in this range ).
Because of this the speed increase for the GPU version is just 2X.
One strange thing is the very bad performance for the Aparapi CPU version, I' am not sure what is causing this, i currently assume that is has to do with the algorithm not being very cache friendly when run on the CPU.

 
The gap has increased to 6.6X between the iterative version and the Aparapi version.

Another speed increase of around 10x.

Conclusions

This test proved that even if we need to run the same kernel multiple times we can still get a very good speed increase ( 10X ) the only condition is that when we start a kernel we need to have a lot of data to process, i would say at least a couple of millions operations for a marginal speed increase and around billions for a really big increase in performance.

3 comments:

  1. My name is Gary Frost. I am the initial inventor/developer of AMD. Just saw this page. You have some great work here. I plan on linking to this page from the Aparapi home page (if that is OK).

    The OpenCL CPU times are curious here. Can I ask what CPU you are using? I have seen OpenCL CPU mode do very very well in some applications, I am not sure what is happening here.

    I would be happy to help diagnose, if you are prepared to share your code.

    Great work.

    Gary

    ReplyDelete
  2. Hi,
    I'm honored that you put the link to my blog on the aparapi home page. I think that Aaparapi is a great piece of technology that makes very easy to write/debug/deploy gpu code. I'm excited that i can very easily make my ideas run on the gpu as easy as writing a method in java.

    My CPU is AMD Phenom II X4 960T Black Edition and my video card is AMD Radeon™ HD 7770 GHz Edition. The code that i used for this experiment can be downloaded from https://docs.google.com/open?id=0By7kYbiSKbj3ako2VHItMHowd00 ( there is also a link on the bottom of the page, i shared the code for all my experiments but i didn't make the link very visible ).
    I think that is something wrong on how is using the cache when runs in OpenCL CPU mode that causes this very bad performance, if you can put some light into this i would be very curious and happy :)

    Thanks,
    Dan.

    ReplyDelete
  3. Hello Vortex, just learning parallelizing. have seen the LargePrimeNumber program.
    Can you please explain a little on how to convert sequential program to parallel by taking an example. Please let me know whether there are any online resources to explain the same.

    ReplyDelete