Room 202, Caldwell

As the rapid development of biotechnology, more complex data sets are now generated to address extremely complex biological problems. It is challenging to develop new statistical methods to analyze such data. In this thesis, I propose a nonparametric hypothesis test and two statistical learning methods to solve biological problems arising from epigenomics, metagenomics, and neuroimaging. First, the proposed test aims at testing the significance of the interaction in bivariate smoothing spline ANOVA model. The derived asymptotic distribution of the test statistic unveils a new version of Wilks phenomenon, and the power is minimax optimal in the sense of Ingster. The performance of the proposed test was demonstrated on discovering differentially methylated regions in a genome-wide DNA methylation study. Second, I propose a statistical learning method that simultaneously identifies microbial species and estimates their abundances without using reference genomes. I show that the proposed method achieves high accuracy in both simulated data and real metagenomic data related to inflammatory bowel disease (IBD), type-2 diabetes (T2D) and obesity. Third, I develop a model-based dictionary learning (MDL) method which provides an effective and flexible framework for different types of data: continuous, discrete and categorical. It also provides a general framework to model data with spatial or temporal correlation. The performance of the MDL method was demonstrated in studying the brain connectivity and learning the cell-type specific expression profile through spatial transcriptomic imaging.