Using topic models to discover bacterial communities in the human microbiome
Data from sequencing bacterial communities are formalized as contingency tables whose columns correspond to different biological sample-specimens. The row-features are a random collection with of Amplicon Sequence Variants (ASVs in the case of 16S rRNA type amplicon sequencing) or gene fragments (in the case of metagenomics). In both cases, these entities are defined after the data are collected, thus imposing a nonparametric framework. There are usually more features-rows than columns imposing necessary regularization through use of Bayesian priors.
However, the classical Dirichlet-multinomial models are insufficient to account for the strong associations (or exclusions) between certain bacteria, thus recent hierarchical models such as latent Dirichlet topic models have provided a more flexible framework that allow mixed membership models more appropriate for these non-Gaussian data. We will show how these hierarchical topic models can enhance our understanding of both longitudinal dependencies between samples and biological dependencies between taxa, regardless of the differences in sampling depth and sources of variability.
This contains work with Kris Sankaran, Pratheepa Jeganathan, Laura Symul, Ben Callahan, and David Relman.