Skip to main content
Skip to main menu Skip to spotlight region Skip to secondary region Skip to UGA region Skip to Tertiary region Skip to Quaternary region Skip to unit footer


Jung Ae Lee

Poultry Science Building, Room 240
PhD Candidate, Statistcs

This dissertation consists of two parts for the topic of sample integrity in high dimensional data. The first part focuses on batch eff ect in gene expression data. Batch bias has been found in many microarray studies that involve multiple batches of samples. Currently available methods for batch eff ect removal are mainly based on gene-by-gene analysis. There has been relatively little development on multivariate approaches to batch adjustment, mainly because of the analytical difficulty that originates from the high dimensional nature of gene expression data. We propose a multivariate batch adjustment method that eff ectively eliminates inter-gene batch eff ects. The proposed method utilizes high dimensional sparse covariance estimation based on a factor model and a hard-thresholding technique. We study theoretical properties of the proposed estimator. Another important aspect of the proposed method is that if there exists an ideally obtained batch, other batches can be adjusted so that they resemble the target batch. We demonstrate the e ffectiveness of the proposed method with real data as well as simulation study. Our method is compared with other approaches in terms of both homogeneity of adjusted batches and cross-batch prediction performance. The second part deals with outlier identi cation for high dimension, low sample size (HDLSS) data. The outlier detection problem has been hardly addressed in spite of the enormous popularity of high dimensional data analysis. We introduce three types of distances in order to measure the \outlyingness" of each observation to the other data points: centroid distance, ridge Mahalanobis distance, and maximal data piling distance. Some asymptotic properties of the distances are studied related to the outlier detection problem. Based on these distance measures, we propose an outlier detection method utilizing the parametric bootstrap. The proposed method also can be regarded as an HDLSS version of quantilequantile plot. Furthermore, the masking phenomenon, which might be caused by multiple outliers, is discussed under HDLSS situation.


Support us

We appreciate your financial support. Your gift is important to us and helps support critical opportunities for students and faculty alike, including lectures, travel support, and any number of educational events that augment the classroom experience. Click here to learn more about giving.

Every dollar given has a direct impact upon our students and faculty.