Alignment Constrained Sampling

Abstract

We present ALICO (ALIgnment COnstrainted) null set generator: a framework to generate randomized versions of an input multiple sequence alignment that preserve some of its crucial features including its dependence structure. In particular, we show that, on average, ALICO samples approximately preserve the PIDs (percent identities) between every pair of input sequences. At the same time our examples demonstrate that the average k-mer composition of each of the sampled sequences show great resemblance to the k-mer composition of our genomic training data. Of note is that ALICO requires only pairwise alignment training data rather than multiple alignment training data.

We demonstrate the utility of ALICO in predicting the correct results returned by the "homology-aware" finders PhyloCon, MEME with conservation prior and PRIORITY-C, as well as by our naive finder GibbsMarkov, applied to the MacIsaac orthologous yeast data. Finally, we show that using ALICO sampling derived p-values to combine results from multiple finders often outperforms its best individual component.

Supplementary for ALICO paper