Friday 28 December 2012

What does 40x mean?

 So you created your algorithm, tested and you got least say a 40x improvement ( like i did :).  What does this mean? How much more data you can process in the same amount of time?

 The answer of this questions is related with the complexity of your algorithm. Least say you have a O(n^2) edit distance algorithm and you can compare 2 string of length 1000 in about 1 second using your CPU and 40 times faster using your GPU, the question is what is the length of the string that you can process in 1 second using your CPU?

 The answer to this question it turns our to be rather simple:
more precisely 6324. So basically you will be able to process strings 6 times bigger in the same amount of time.

 How i calculated this? i will let you figure it out :). As a help you can use the following speed gain table.


Algorithm complexity Speed gain ( D = speedup factor )


Thursday 27 December 2012

Matrix multiplication

Today i tested matrix multiplication using Aparapi and my 7770 card. The results are great, another 40X improvement over my quad core 960T cpu.


The interesting part is that the shape of the graphs is the same. This means that we can't change the complexity of the algorithm by using a gpu but only make it a lot run faster.

The results 

and also the Java Code
 import java.util.Random;  
 import com.amd.aparapi.Kernel;  
 public class MatrixMultiplication {  
   public static void main ( String [] arg ){  
     test( 500,500,500 );   
     test( 1000,1000,1000 );   
     test( 1500,1500,1500 );   
     test( 2000,2000,2000 );   
     test( 2500,2500,2500 );   
   }  
   public static void test ( int a, int b, int c ){  
     Random rnd = new Random();   
     int [] A = new int[a*b];  
     int [] B = new int[b*c];  
     for ( int i=0; i<A.length; i++ ){  
       A[i] = rnd.nextInt( 1000 );   
     }  
     for ( int i=0; i<B.length; i++ ){  
       B[i] = rnd.nextInt( 1000 );   
     }  
     long min = Integer.MAX_VALUE;  
     for (int i = 0; i < 2; i++) {  
       long start = System.currentTimeMillis();  
       int[] rezultat = mult(A, B, a, b, c);  
       long end = System.currentTimeMillis();  
       min = Math.min( min, end - start );   
     }  
 //    printMatrix( A, a, b );   
 //    System.out.println();   
 //    printMatrix( B, b, c );   
 //    System.out.println();   
 //    printMatrix( rezultat, a, c );  
     System.out.println( "[" + a + "," + b +"]X[" + b + "," + c + "] " + min + " ms" );   
   }  
   public static void printMatrix( int [] m, int a, int b ){  
     for ( int i=0; i<a; i++ ){  
       for ( int j=0; j<b; j++ ){  
         System.out.print( m[i*b+j] + " " );   
       }  
       System.out.println();   
     }  
   }  
   public static int [] mult( final int [] A, final int [] B, final int a, final int b, final int c ) {  
     final int [] rezultat = new int [ a * c ];   
     Kernel kernel = new Kernel() {  
       @Override  
       public void run() {  
         int k = getGlobalId();  
         int linie = k/c;  
         int coloana = k%c;  
         for ( int i=0; i<b; i++ ){  
           rezultat[k] += A[ linie*b + i ] * B[ i*c + coloana ];   
         }  
       }  
     };  
     kernel.setExecutionMode( Kernel.EXECUTION_MODE.CPU );  
     kernel.execute( rezultat.length );  
     return rezultat;   
   }  
 }  

Wednesday 26 December 2012

Number of divisors

 I have found this: http://code.google.com/p/aparapi/, a java library that will translate java code to OpenCL binary code ( or something like that ), and i though to take it for a spin to see what i can do with it.

 For the first test we will calculate the number of divisors for the fist N natural numbers, will test with N = {1048576, 2097152, 3145728, 4194304, 5242880    6291456, 7340032, 8388608, 9437184 }. 
The complexity of the algorithm that i have implemented is n*sqrt(n) so normally it will not scale very well with large values of N


But before we begin let me introduce you to my rig:
 CPU: Phenom X4 960T
 GPU: Radeon 7770 GHZ Edition

 The price of the CPU was almost identical to that of the GPU so i think this is a fair comparison to see how much more performance you can get for the same amount of money if you use a aparapi and a GPU to solve it. 

 The performance gain is from 17X for the small instances up to 41X for the largest ones. This is almost unbelievable if you take into account how easy was to modify the algorithm into an Aparapi kernel and how cheap is to buy one of this cards.

 In the next weeks i will test this technology with some other problems to see how will perform.

Below are the results and the java code i have used for this test:

Results


1:  import com.amd.aparapi.Kernel;  
2:  public class NumberOfDivisors {  
3:    public static void main(String[] arg) {  
4:      for (int k = 1; k <= 9; k++) {  
5:        long min = Long.MAX_VALUE;  
6:        for (int t = 0; t < 4; t++) {  
7:          final int[] a;  
8:          a = new int[1024 * 1024 * k];  
9:          for (int i = 0; i < a.length; i++) {  
10:            a[i] = 0;  
11:          }  
12:          Kernel kernel = new Kernel() {  
13:            @Override  
14:            public void run() {  
15:              int k = getGlobalId();  
16:              for (int i = 2; i * i <= k; i++) {  
17:                if (k % i == 0) {  
18:                  a[k]++;  
19:                }  
20:              }  
21:            }  
22:          };  
23:          kernel.setExecutionMode(Kernel.EXECUTION_MODE.JTP);  
24:          long start = System.currentTimeMillis();  
25:          kernel.execute(a.length);  
26:          long end = System.currentTimeMillis();  
27:          min = Math.min(min, end - start);  
28:          // System.out.println(end - start + " ms");  
29:        }  
30:        System.out.println(k + " " + min + ",");  
31:      }  
32:    }  
33:  }