Parallel Support Vector Machines

By: Sanjna Rao

Guided by: Thorsten Joachims for CS790

The aim of this project was to implement a parallel version of Support Vector Machines (SVM). SVMs are a relatively new technique in Machine Learning and can be used for Pattern Recognition or Non-linear Regression just like Neural Networks. While considered faster than Neural Networks, training large datasets with SVMs can be computationally intensive. Given a large enough dataset, the training time can range from days to maybe even weeks. We have in this project, attempted to overcome this problem by parallelizing the algorithm to reduce training times. We have tested our algorithm on a 4-processor machine with moderate sized datasets to check for convergence and feasibility of the algorithm and are now moving on to testing large datasets in a distributed environment.

A more detailed look into the project:

The idea of a Support Vector Machine is based on two mathematical operations:

1. Non-linear mapping of an input vector into a high dimensional feature space that is hidden from the input and output.

2. Construction of an optimal hyperplane from features discovered in Step 1.

This hyperplane is a decision surface that is constructed that separates members of different classes in such a way as to maximize the distance between them.

This goal of finding the "optimal" hyperplane presents itself as a Convex Optimization problem. The simplest way to find a solution to such kind of problems is by the Gradient Ascent approach that follows the steepest ascent path to the optimal solution. A more efficient way is to use the Chunking, Decomposition algorithms. The basic idea behind our parallel algorithm is derived from these two algorithms, which works by distributing the dataset to different processors and then aggregating results until convergence.

The convergence criterion is determined by the KKT or Karush-Kuhn-Tucker conditions being satisfied or not. And we have found the parallel algorithm to always converge, satisfying the aforementioned conditions even though training is distributed.

The algorithm has been tested on a few datasets from the UCI Machine Learning Repository. Based on the tests on these datasets, the results were favorable, with convergence in 3-4 iterations on an average, with relatively short training times. Although we have not tested extensively on many datasets and conditions, we hope to get there soon.

I am glad to have an oppurtunity to present the above ideas and results to my fellow students and faculty and am especially thankful to Prof. Thorsten Joachims for his wonderful guidance throughout this project.