Hashing for Large Scale Search and Learning



Asymmetric Locality Sensitive Hashing (ALSH)

We formally show that the popular framework of Locality Sensitive Hashing (LSH) is not sufficient for obtaining sub-linear time solutions to Maximum Inner Product Search (MIPS). We develop a new hashing framework asymmetric LSH (ALSH) which is strictly more powerful than LSH. In this framework we construct provable sub-linear query time algorithms for MIPS. In practice, the new hashing scheme leads to big computational savings in the task of item recommendation on Movielens (10M) and Netflix datasets.

Anshumali Shrivastava and Ping Li. Asymmetric LSH (ALSH) for Sub-linear Time Maximum Inner Product Search (MIPS). [pdf][slides][video]
In Neural Information Processing Systems (NIPS) 2014. Best Paper Award
Anshumali Shrivastava and Ping Li. Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment.[pdf]
International World Wide Web Conference (WWW) 2015.
Anshumali Shrivastava and Ping Li. Improved ALSH for Maximum Inner Product Search (MIPS). [arxiv]
Technical Report 2014.


Efficient Replacements for Minwise Hashing

Large scale data processing applications require computing several minwise hashes of the data matrix as one of the primitive operations. This is a computationally very expensive. We propose a new way of generating multiple hashes by first binning the data, computing local minwise like hashes in the bin, and finally densifying the bins via random rotation. As a result, we can generate thousands (even more) different hashes with the required collision probability in the cost of one vanilla minhash. This leads to algorithmic query time improvement over traditional widely adopted methods. For very sparse data, adding coin flips during densification step provably improves the variance due to extra randomization. Experiments on WEBSPAM, NEWS20, MNIST, RCV1 and URL datasets.

Anshumali Shrivastava and Ping Li. Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. [pdf][slides][video]
International Conference on Machine Learning (ICML) 2014.
Anshumali Shrivastava and Ping Li. Improved Densification of One Permutation Hashing.[pdf]
Conference on Uncertainty in Artificial Intelligence (UAI) 2014.


The Right Hash Function

It was taken for granted that the two popular hashing schemes minhash and simhash are theoretically incomparable and the choice between them is decided based on whether the desired notion of similarity is cosine similarity or resemblance. We show that it is possible to compare these two hashing schemes theoretically. In particular, we provide a framework to compare two locality sensitive hashing (LSH) schemes meant for different similarity measures. Our theoretical and empirical results shows a counter-intuitive result that for sparse data minhash is the preferred choice even when the desired measure is cosine similarity. Experiments on MNIST, NEWS20, NYTIMES, RCV1, URL and WEBSPAM datasets

Anshumali Shrivastava and Ping Li. In Defense of Minhash over Simhash. [pdf] [slides]
International Conference on Artificial Intelligence and Statistics (AISTATS) 2014
Anshumali Shrivastava and Ping Li. Fast Near Neighbor Search in High-Dimensional Binary Data. [pdf] [slides]
European Conference on Machine Learning (ECML) 2012 Top few papers invited for journal submission


Beyond Pairwise: Higher Order Similarity Search

There are notions of similarity which are not pairwise, for instance 3-way resemblance which is a similarity between 3 points. Such similarity measures naturally model many real scenarios like group recommendations, clustering, etc. k-way similarity measures lead to blow up in the computation complexity because of higher combinations. We formalize the notion of search with k-way similarity measure and further provide provably fast hashing algorithms to solve them. We demonstrate the applicability of 3-way and higher similarity measures in retrieval tasks and on "Google Sets".

Anshumali Shrivastava and Ping Li. Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search. [pdf] [slides]
In Neural Information Processing Systems (NIPS) 2013.


Hash Kernel Features for Large Scale Learning

We show how to convert locality sensitive hashing scheme to efficient features for linear kernel learning. The large size of the range space of hash functions blows up the dimension for linear learning. We propose b-bit minwise hashing and efficient quantizations of random projections, which leads to informative hash functions having small range spaces thereby allowing for efficient kernel feature extraction. Experimental results shows that such quantizations are very effective and outperforms other state-of-the-art hash kernels including the hashing scheme of VW (Vowpal Wabbit).

Ping Li, Michael Mitzenmacher and Anshumali Shrivastava. Codings for Random Projections. [pdf]
International Conference on Machine Learning (ICML) 2014.
Ping Li, Anshumali Shrivastava, Joshua Moore and Christian Konig. Hashing Algorithms for Large Scale Learning. [pdf]
In Neural Information Processing Systems (NIPS) 2011.


Graphs and Social Networks Mining



Social Networks and Chemical Compound Classification

Correlations between vectors obtained from power iterations of adjacency matrix have good discriminative properties. Our proposal is to replace graph with a covariance matrix whose entries are correlations between vectors from truncated power iteration. Such a covariance matrix can be computed in O(E) time making our proposal scalable to giant graphs. A direct advantage is that we can directly feed p.s.d covariance matrices rather than combinatorially hard graph structures to any machine learning algorithms like classification, clustering, etc. As a result, we can classify the sub-field of a researcher from his/her local(ego) collaboration network structure. The same technique can classify chemical activities of compounds active in anti-cancer screen. Results show that the new scheme outperforms other state-of-the-art graph classification approaches.

Anshumali Shrivastava and Ping Li. A New Space for Comparing Graphs. [pdf][slides]
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) 2014. Best Paper Award
Anshumali Shrivastava and Ping Li. Graph Kernels via Functional Embedding. [arxiv]
Technical Report 2014.