Testing variability of multinomial datasets with applications to text analysis
Multinomial testing is a classical topic in statistics that has received recent interest in the high-dimensional setting. We consider the problem of testing for equality of the group means of a multinomial dataset, where we allow for heterogeneity among the frequency vectors and the number of trials. To address this task, we propose the DELAC test (debiased and length-adjusted chi-squared test) and show that it is asymptotically normal and minimax optimal. We apply our methodology to two real-world datasets to identify movies with heterogeneous reviews on Amazon and authors with diverse abstracts in top statistics journals.