Abstract book May 8-10, 2018 1 Abstracts of Invited Talks (Listed by the order of presentation) Session 1: 9:05 am - 10:25 am, May 9 Chair: Michael Akritas (Penn State) Title: Estimation of the Boundary of a Variable observed with Symmetric Error Speaker: Ingrid Van Keilegom (KU Leuven, Belgium) Abstract: Consider the model $Y = X + \varepsilon$ with $X = \tau + Z$, where $\tau$ is an unknown constant (the boundary of $X$), $Z$ is a random variable defined on $R^+$, $\varepsilon$ is a symmetric error, and $\varepsilon$ and $Z$ are independent. Based on a iid sample of $Y$ we aim at identifying and estimating the boundary $\tau$ when the law of $\varepsilon$ is unknown (apart from symmetry) and in particular its variance is unknown. We propose an estimation procedure based on a minimal distance approach and by making use of Laguerre polynomials. Asymptotic results as well as finite sample simulations are shown. The paper also proposes an extension to stochastic frontier analysis, where the model is conditional to observed variables. The model becomes $Y = \tau(w_1,w_2) + Z +\varepsilon$, where $Y$ is a cost, $w_1$ are the observed outputs and $w_2$ represents the observed values of other conditioning variables, so $Z$ is the cost inefficiency. Some simulations illustrate again how the approach works in finite samples. Title: Wild residual bootstrap inference for penalized quantile regression with heteroscedastic errors Speaker: Lan Wang (University of Minnesota) Abstract: We consider a heteroscedastic regression model in which some of the regression coefficients are zero but it is not known which ones. Penalized quantile regression is a useful approach for analyzing such heterogeneous data. By allowing different covariates to be relevant for modeling conditional quantile functions at different quantile levels, it permits a more realistic sparsity assumption and provides a more complete picture of the conditional distribution of a response variable. Existing work on penalized quantile regression has been mostly focused on point estimation. It is challenging to estimate the standard error. Although bootstrap procedures have recently been demonstrated effective for making inference for penalized mean regression, they are not directly applicable to penalized quantile regression with heteroscedastic errors. We prove that a wild residual bootstrap procedure recently proposed by Feng et al. (2011) for unpenalized quantile regression is asymptotically valid for approximating the distribution of a penalized quantile regression estimator with an adaptive L1 penalty; and that a modified version of this wild residual bootstrap procedure can be used to approximate the distribution of L1 penalized quantile regression. We establish the bootstrap consistency theory, demonstrate 2 appropriate finite sample performance through a simulation study, and illustrate its application using an ozone effects data set. The new methods do not need to estimate the unknown error density function. (Joint work with Ingrid van Keilegom and Adam Maidman) Title: A two-sample test of equality of means in high dimensional data Speaker: Haiyan Wang (Kansas State University) Abstract: This research is interested in testing equality of two sample means in high dimensional data in which the sample sizes may be much less than the dimension. Improvement still can be achieved despite significant effort in recent literature that modifies the Hotelling's T2-statistics by either bypassing the estimation of high dimensonal covariance matrices (cf. Chen & Qin 2010 Annals of Stat., Srivastava et al. 2013 JMVA, Gregory et al. 2015 JASA) or estimating the precision matrix after imposing sparseness condition (Cai et al. 2014 JRSSB). Here we present a new test statistic that is particularly powerful when the correlation between components of the data vector reduces as the separation of the component indices increases. The limiting distribution of the test statistic and power of the test are studied. Simulation results will be presented to show the numerical performance of the test and to compare with other tests in the literature. Title: Bivariate Tests for Location Based on Data Depth Speaker: Thomas Hettmansperger (Penn State University) Abstract: Starting with the ideas contained in the data depth paper by Regina Liu (AOS, 1990), we develop the simplicial data depth concept into a bivariate test of location that is distribution free under the assumption of angular symmetry. The test statistic counts the number of data triangles that contain the null hypothesized value. A straightforward method for computing this statistic is provided. The exact null distribution is discussed, as well as the asymptotic null distribution along with a formula for an approximate critical value. The method is illustrated on a data set. Session 2: 10:40 am - 12:00 noon, May 9 Chair: Bing Li (Penn State) Title: Multilayer tensor factorization with applications to recommender systems Speaker: Annie Qu (University of Illinois at Urbana-Champaign) Abstract: Recommender systems have been widely adopted by electronic commerce and entertainment industries for individualized prediction and recommendation, which benefit consumers and improve business intelligence. In this article, we propose an innovative method, namely the recommendation engine of multilayers (REM), for tensor recommender systems. The proposed method utilizes the structure of a tensor response to integrate information from multiple modes, and creates an additional layer of nested latent factors to accommodate betweensubjects dependency. One major advantage is that the proposed method is able to address the 3 ``cold-start" issue in the absence of information from new customers, new products or new contexts. Specifically, it provides more effective recommendations through sub-group information. To achieve scalable computation, we develop a new algorithm for the proposed method, which incorporates a maximum block improvement strategy into the cyclic block-wisecoordinate-descent algorithm. In theory, we investigate both algorithmic properties for global and local convergence, along with the asymptotic consistency of estimated parameters. Finally, the proposed method is applied in simulations and IRI marketing data with 116 million observations of product sales. Numerical studies demonstrate that the proposed method outperforms existing competitors in the literature. This is joint work with Xuan Bi and Xiaotong Shen. Title: On asymmetric dependence in ordinal categorical data: sub copula-based regression approach Speaker: Daeyoung Kim (University of Massachusetts) Abstract: For the analysis of ordinal contingency tables, a new asymmetric association measure is developed. The proposed method uses a nonparametric and model-free approach, the bilinear extension copula-based regression between ordinal categorical variables, to measure the asymmetric predictive powers of the variables of interest. Unlike the existing measures of asymmetric association, the proposed bilinear extension copula-based measure is able to capture nonlinear pattern and the magnitude of the proposed measure can be interpreted as the degree of asymmetric association in the ordinal contingency table. The theoretical properties of the proposed asymmetric association measure are investigated. We illustrate the performance and advantages of the proposed measure using simulation studies and real data examples. Title : Clustering Data on the Sphere: State of the art and a Poisson kernel-based Algorithm Speaker: Markatou, Marianthi (University at Buffalo) Abstract: Model-based clustering of directional data has been widely used by many authors using mixtures of different directional distributions such as von Mises-Fisher, inverse stereographic projections of multivariate normal and Watson distributions. We discuss a clustering method based on mixtures of Poisson kernel based distributions on the sphere, derive the estimates of the parameters and describe the corresponding clustering algorithm. We discuss convergence of the algorithm and study the role of initialization on its performance, where performance is measured by ARI, macro-precision and macro-recall. A comparison study shows that Poisson kernel based clustering performs, in many cases, superior to the state of the art, mixture of von Mises-Fisher distribution. We describe an algorithm for generating high-dimensional directional data from the Poisson kernel based distribution for a simulation-based comparison, and propose a new method for estimating the number of clusters in this setting. Title: Normalization of transcript degradation improves accuracy in RNA-seq analysis Speaker: Jiping Wang (Northwestern University) 4 Abstract: RNA-sequencing (RNA-seq) is a powerful high-throughput tool to profile transcriptional activities in cells. The observed read counts can be biased by various factors such that they do not accurately represent the true relative abundance of mRNA transcript abundance. Normalization is a critical step to ensure unbiased comparison of gene expression between samples or conditions. Here we show that the gene-specific heterogeneity of transcript degradation pattern across samples presents a common and major source of unwanted variation, and it may substantially bias the results in gene expression analysis. Most existing normalization approaches focused on global adjustment of systematic bias are ineffective to correct for this bias. We propose a novel method based on matrix factorization over-approximation that allows quantification of RNA degradation of each gene within each sample. The estimated degradation index scores are used to build a pipeline named DeGNorm (stands for degradation normalization) to adjust read count for RNA degradation heterogeneity on a gene-by-gene basis while simultaneously controlling sequencing depth. The robust and effective performance of this method is demonstrated in an extensive set of real RNA-seq data and simulated data. Session 3: 9:00 am - 10:20 am, May 10 Chair: Bharath Sriperumbudur (Penn State) Title: New computational methods for nonparametric quantile regression and its related models Speaker: Bo Kai (College of Charleston) Abstract: Quantile regression aims at estimating the conditional quantiles of the response variable. Compared to least squares regression, quantile regression provides a more comprehensive picture of the relationship between the response and its covariates. The optimization for quantile regression is challenging because the objective function is nondifferentiable. In this work, we focus on the optimization problems in nonparametric quantile regression and its related models. Existing algorithms may yield estimates that are not very smooth or stable. To address these issues, we propose a new class of algorithms which produce smoother and stabler estimates in nonparametric quantile regression models. The finite sample performance of the proposed algorithms is investigated in several numerical studies. Title : Hyperrectangular Tolerance and Prediction Regions for Setting Multivariate Reference Regions in Laboratory Medicine Speaker: Derek Young (University of Kentucky) Abstract: Reference regions are widely used in clinical chemistry and laboratory medicine to interpret the results of biochemical or physiological tests of patients. There are well-established methods in the literature for reference limits for univariate measurements, however, only limited methods are available for the construction of multivariate reference regions. This is because 5 traditional multivariate statistical regions (e.g., confidence, prediction, and tolerance regions) are not constructed based on a hyperrectangular geometry. We address this problem by developing multivariate hyperrectangular nonparametric tolerance regions for setting the reference regions. Our approach utilizes statistical data depth to determine which points to trim and then the extremes of the trimmed dataset are used as the faces of the hyperrectangular region. Extensive coverage results show the favorable performance of our algorithm provided a minimum sample size criterion is met. Our procedure is used to obtain reference regions for insulin-like growth factor concentrations in the serum of healthy adults. Title : Temporal Exponential-Family Random Graph Models with Time-Evolving Latent Block Structure for Dynamic Networks Speaker: Kevin Lee (Western Michigan University) Abstract: Model-based clustering of dynamic networks has emerged as an important research topic in statistical network analysis. It is critical to effectively and efficiently model the timeevolving latent block structure of dynamic networks in practice. However, the focus of most existing methods is on the static or temporally invariant block structure. We present a principled statistical clustering of large-scale dynamic networks through the temporal exponential-family random graph models with a hidden Markov structure. The hidden Markov structure is used to effectively infer the time-evolving block structure of dynamic networks. We prove the identification conditions for both network parameters and transition matrix in our proposed model-based clustering. We propose an effective model selection criterion based on the integrated classification likelihood to choosing an appropriate number of clusters. We develop a scalable variational expectation-maximization algorithm to efficiently solve the approximate maximum likelihood estimate. The numerical performance of our proposed method is demonstrated in simulation studies and two real data applications to dynamic international trade networks and dynamic email networks of a large institute. Title: Causal Inference via Balancing Covariates Speaker: Yeying Zhu (University of Waterloo, Canada) Abstract: An important goal in estimating the causal effect is to achieve balance in the covariates. We propose using kernel distance to measure balance across different treatment groups and propose a new propensity score estimator by setting the kernel distance to be zero. Compared to other balance measures, such as absolute standardized mean difference (ASMD) and Kolmogorov Smirnov (KS) statistic, Kernel distance is one of the best bias indicators in estimating the causal effect. The estimating equations are solved by generalized method of moments. Simulation studies are conducted across different scenarios varying in the degree of nonlinearity in both the propensity score model and the outcome model. The proposed approach outperforms many existing approaches including the well-known covariate balance propensity score (CBPS) approach when the propensity score model is mis-specified. An application to data from International Tobacco Control (ITC) policy evaluation project is provided. 6 Session 4: 10:35 am-11:55 am, May 10 Chair: Le Bao (Penn State) Title : Transformed Variable Selection in Sufficient Dimension Reduction Speaker: Yuexiao Dong (Temple University) Abstract: In this paper, we combine variable transformation with sufficient dimension reduction to achieve model-free variable selection. Existing model-free variable selection methods via sufficient dimension reduction requires a critical assumption that the predictor distribution is elliptically contoured. We suggest a nonparametric variable transformation method after which the predictors become normal. Variable selection is then performed based on the marginally transformed predictors. Asymptotic theory is established to support the proposed method. The desirable variable selection performance of the proposed method is demonstrated through simulation studies and a real data analysis. Title : Central Quantile Subspace Speaker: Eliana Christou (University of North Carolina at Charlotte) Abstract: Existing dimension reduction techniques focus on the conditional distribution of the response given the covariates, where specific interest focuses on statistical functionals of the distribution, such as the conditional mean, conditional variance and conditional quantile. We introduce a new method for inferring about the conditional quantile of the response given the covariates and we introduce the notion of the Central Quantile Subspace (CQS). The purpose of this paper is threefold. First, we focus on cases where the tau-th conditional quantile, for tau in (0,1), depends on the predictor X through a single linear combination B'X and we show that we can estimate B consistently up to a multiplicative scalar, even though the estimate might be based on a misspecified link function. Second, we extend the result to tau-th conditional quantiles that depend on X through a d - dimensional linear combination B'X, where B is a p x d matrix, d>1, and propose an iterative procedure to produce more vectors in the tau-th CQS, which are shown to be root n consistent. Third, we extend our proposed methodology by considering any statistical functional of the conditional distribution and estimate the fewest linear combinations of X that contain all the information on that functional. Title : A joint learning of multiple precision matrices with sign consistency Speaker: Yuan Huang (University of Iowa) Abstract: The Gaussian graphical model is a popular tool for inferring the relationships among random variables, where the precision matrix has a natural interpretation of conditional independence. With high-dimensional data, sparsity of the precision matrix is often assumed, and various regularization methods have been applied for estimation. Under quite a few important 7 scenarios, it is desirable to conduct the joint estimation of multiple precision matrices. In joint estimation, entries corresponding to the same element of multiple precision matrices form a group, and group regularization methods have been applied for estimation and identification of the sparsity structures. For many practical examples, it can be difficult to interpret the results when parameters within the same group have conflicting signs. Unfortunately, the existing methods lack an explicit mechanism concerning with the sign consistency of group parameters. To tackle this problem, we develop a novel regularization method for the joint estimation of multiple precision matrices. It effectively promotes the sign consistency of group parameters and hence can lead to more interpretable results, while still allowing for conflicting signs to achieve full flexibility. Its consistency properties are rigorously established. Simulation shows that the proposed method outperforms the competing alternatives under a variety of settings. Title: Decoding the Perception of Music Genres with High-resolution fMRI Data Speaker: Han Hao (University of North Texas) Abstract: Recent studies have demonstrated a close relationship between computational acoustic features and neural brain activities, and have largely advanced our understanding of auditory information processing in the human brain. However, differentiating music genres requires mental processing that is sensitive to specific auditory and schematic information‚ the precise features of which, as well as their cortical organization, are yet to be properly understood. We developed a multivariate clustering approach for fMRI data based on stimulus encoding models. Analysis on a fMRI data from music-listening tasks yielded significant clusters and revealed geometric patterns of music cognition. Session 5: 2:00 pm - 3:20 pm, May 10 Chair: Aleksandra Slavkovic (Penn State) Title: Improving Small-Area Estimates of Disability: Combining the American Community Survey with the Survey of Income and Program Participation Speaker: Jerry Maples (U.S. Census Bureau) Abstract: The Survey of Income and Program Participation (SIPP) is designed to make national level estimates of changes in income, eligibility for and participation in transfer programs, household and family composition, labor force behavior, and other associated events. Used cross-sectionally, the SIPP is the source for commonly accepted estimates of disability prevalence, having been cited in the findings clause of the Americans with Disability Act. Because of its sample size, SIPP is not designed to produce highly reliable estimates for 8 individual states. The American Community Survey (ACS) is a large sample survey which is designed to support estimates of characteristics at the state and county level, however, the questions about disability in the ACS are not as comprehensive and detailed as in SIPP. We propose combining the information from the SIPP and ACS surveys to improve, i.e. lower variances of, state estimates of disability (as defined by SIPP). Speaker: Yanling Zuo (Minitab Inc) Title: Helping Users to Make High-Quality Products -- My Wonderful Professional Life at Minitab Inc Abstract: My presentation contains the following aspects: 1) Provide background Info of Minitab Inc, Six Sigma Methodology, and myself; 2) Use a sample project on developing stability study commands for estimating shelf-life of a drug for pharmaceutical industry to illustrate my work at Minitab; 3) By reflecting my work experience, promote strong collaborations among statisticians, computer scientists, and subject-knowledge experts to solve concrete problems using big data. Sincerely hope this type of collaborations can be built into PSU statistics department undergraduate and graduate education programs. Title : Use of Advanced Statistical Techniques in Banking Speakers: Xiaoyu Liu & Hanyu Yang (Wells Fargo) Abstract: There is a great deal of interest in the use of advanced statistical methodologies in finance industry. In the presentation, we will give an overview of the statistical approaches we developed for predictive modeling in credit and operational risk management, including varyingcoefficient loss forecast models and fractional response modeling techniques. We will also introduce the use of machine learning techniques such as ensemble algorithms and neural networks in loss prediction, as well as the diagnostics and interpretable tools for opening up the "black box" of machine learning techniques. If time permits, we will describe the quantitative communities at Wells Fargo as well as employment opportunities. Title: A novel method for estimating the causal effects of latent classes in complex survey data. Speaker: Joseph Kang (US Centers for disease control and prevention) Abstract: In the literature of behavioral sciences, latent class analysis (LCA) has been used to effectively cluster multiple survey items. Statistical inference with an exposure variable, which is identified by the LCA model, is challenging because 1) the exposure variable is essentially missing and harbors the uncertainty of estimating parameters in the LCA model and 2) confounding bias adjustments need relevant propensity score models. In addition to these challenges, complex survey design features and survey weights will have to be accounted for if they are present. Our solutions to these issues are to 1) assess point estimates with the design-based estimating function approach which was described in Binder (1983) and 2) obtain variance estimates with the Jackknife technique. Using the NHANES data set, our LCA model identified a latent class for men who have sex with men (MSM) and built new propensity score weights that adjusted the prevalence rates of Herpes Simplex Virus Type 2 (HSV-2) for MSM. 9 Abstracts of Posters (Ordered by first name) Ann Johnston (Penn State) An Algebraic Approach to Categorical Data Fusion for Population Size Estimation Information from two or more administrative registries can be combined to estimate (via capturerecapture) the size of a population. As a preliminary step, this requires assignment of betweenregistry record linkage. Often, errors are present in the fused data, with the used record linkage assignment process suggesting an error mechanism (eg: ghost record creation, failure to match, band reading error). Recent work has used a latent multinomial model (in a Bayesian framework) to incorporate record linkage error mechanisms into the population size estimation process. Given observed capture histories, the associated fiber of all possible latent capture histories can be explored via a block Metropolis-Hastings algorithm, with chain irreducibility guaranteed by proposing moves from a Markov basis appropriate to the error mechanism. Existing work has assumed independence between registries. We relax this assumption, studying the fiber of latent capture histories under minimal odds ratio assumptions. Further, we use algebraic ideas to extend this approach to the setting of registries equipped with covariates, where the problem becomes the more general problem of data fusion. Ardalan Mirshani (Penn State) Adaptive Function-on-Scalar Smoothing Elastic Net We propose a new methodology, called AFSSEN, to simultaneously select important variables and produce smooth estimates of their parameters in a function-on-scalar linear model with subGaussian errors and high-dimensional predictors. In our model, the data live in an general real separable Hilbert space, H, but an arbitrary linear operator of the parameters are enforced to lie in an RKHS, K, so that the parameter estimates inherit properties from the RKHS kernel, such as smoothness and periodicity. We use a regularization method that exploits an adaptive Elastic Net penalty where the L1 and L2 norms are introduced in H and K respectively. Using two norms we are able to better control both smoothing and variable selection. AFFSEN is illustrated via a simulation study and microbiome data using a very fast algorithm for computing the estimators based on a functional coordinate descent whose interface is written in R, but with the backend written in C++. Arun Srinivasan (Penn State) Compositional Knockoff Filter for FDR Control in Microbiome Regression Analysis 10 A critical task in microbiome analysis is to identify microbial taxa that are associated with a response of interest. Most existing statistical methods examine the association between the response and one microbiome feature at a time, then followed by multiple testing adjustment such as false discovery rate (FDR) control. Despite feasibility, these methods are often underpowered due to some unique characteristics of microbiome data, such as highdimensionality, compositional constraint and complex correlation structure. In this paper, we adapt the use of the knockoff filter to provide strong finite sample false discovery rate control in the context of linear log-contrast models for regression analysis of microbiome compositional data. Alternative to applying multiple testing correction to large number of individual p-values, our framework achieved the FDR control in a regression model that jointly analyzes the whole microbiome community. By imposing an l1-regularization in the regression model, a subset of bugs is selected as related to the response under a pre-specified FDR threshold. The proposed method is demonstrated via simulation studies and its usefulness is illustrated by an application to a microbiome study relating microbiome composition to host gene expression. Beomseok Seo (Penn State) Computing Mean Partition and Assessing Uncertainty for Clustering Analysis In scientific data analysis, clusters identified computationally often substantiate existing or motivate new hypothesis. Due to the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, the notions of mean and variance are not clear-cut. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome the barrier by aligning clusters via soft matching solved by optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. In addition, the cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Topological relationships between clusters such as match, split, and merge can be revealed. A confidence point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real data sets are provided. Changcheng Li (Penn State) Optimal Projection Tests in High Dimensionality Hypothesis testing is of great importance in multivariate statistics, such as one-sample mean testing and two-sample mean testing problems. When the dimensionality of a population is high, classical methods like likelihood ratio test and Hotelling's $T^2$ test become infeasible due to the singularity of sample covariance matrix. In this article, we propose a framework called $U$projection test to deal with hypothesis testing of high-dimensional multivariate linear regression coefficients. We first discuss about projection tests for the hypothesis testing of linear regression coefficients and obtain results about the optimal projection direction. $U$-projection test utilizes the information provided by the covariance in the construction of test 11 in an optimal way, while it avoids the power loss caused by sample splitting in the samplesplitting projection test. It gives us the flexibility in estimation of projection direction, and different estimation scheme of projection directions can lead to different tests. In fact, the framework of $U$-projection test gives us a way to extend various existing test statistics to general linear regression coefficient testing problems from mean testing problems. This flexibility makes $U$-projection test applicable to various alternative hypotheses (whether they are sparse or not) in various covariance structure situations. Various properties of $U$-projection test and its connection to some existing tests are studied and numerical studies are also carried out. We show that the proposed $U$-projection test is asymptotic equivalent to some existing tests in low-correlation cases and is superior in some high-correlation ones. Christian Schmid (Penn State) Exponential random graph models with big networks: Maximum pseudolikelihood estimation and the parametric bootstrap With the growth of interest in network data across fields, the Exponential Random Graph Model (ERGM) has emerged as the leading approach to the statistical analysis of network data. ERGM parameter estimation requires the approximation of an intractable normalizing constant. Simulation methods represent the state-of-the-art approach to approximating the normalizing constant, leading to estimation by Monte Carlo maximum likelihood (MCMLE). MCMLE is accurate when a large sample of networks is used to approximate the normalizing constant. However, MCMLE is computationally expensive, and may be prohibitively so if the size of the network is on the order of 1,000 nodes (i.e., one million potential ties) or greater. When the network is large, one option is maximum pseudolikelihood estimation (MPLE). The standard MPLE is simple and fast, but generally underestimates standard errors. We show that a resampling method-the parametric bootstrap-results in accurate coverage probabilities for confidence intervals. We find that bootstrapped MPLE can be run in 1/5th the time of MCMLE. We study the relative performance of MCMLE and MPLE with simulation studies, and illustrate the two different approaches by applying them to a network of bills introduced in the United State Senate. Claire Kelling (Penn State) Combining Geographic and Social Proximity to Model Urban Domestic and Sexual Violence In order to understand the dynamics of crime in urban areas, it is important to investigate the socio-demographic attributes of the communities as well as the interactions between neighborhoods. If there are strong social ties between two neighborhoods, they may be more likely to transfer ideas, customs, and behaviors between them. This implies that not only crime itself but also crime prevention and interventions could be transferred along these social ties. Most studies on crime rate inference use spatial statistical models such as spatially weighted regression to take into account spatial correlation between neighborhoods. However, in order to obtain a more flexible model for how crime may be related across communities, one must take into account social proximity in addition to geographic proximity. In this paper, we develop techniques to combine geographic and social proximity in spatial generalized linear mixed models in order to estimate domestic and sexual violence in Detroit, Michigan and Arlington 12 County, Virginia. The analysis relies on combining data from local and federal data sources such as the Police Data Initiative and American Community Survey. By comparing three types of CAR models, we conclude that adding information on social proximity to spatial models, we create more accurate estimation of crime in communities. Debmalya Nandy (Penn State) Covariate Information Number for Feature Screening in Ultrahigh Dimension Modern technological advances in various scientific fields generate ultrahigh-dimensional supervised data with sparse signals, i.e. a limited number of samples (n) each with a very large number of covariates (p >> n), only a small share of which is truly associated with the response. In these settings, major concerns on computational burden, algorithmic stability, and statistical accuracy call for substantially reducing the feature space by eliminating redundant covariates before the application of any sophisticated statistical analysis. Following the development of Sure Independence Screening (Fan and Lv, 2008, JRSS-B) and other model- and correlationbased feature screening methods, we propose a model-free procedure called the Covariate Information Screening (CIS). CIS uses a marginal utility built upon Fisher Information, possesses the sure screening property, and is applicable to any type of response. An extensive simulation study and an application to transcriptomic data in rats reveal CIS‚Äôs comparative strengths over some popular feature screening methods. Elena Hadjicosta (Penn State) Consistent Goodness-of-Fit Tests for Gamma Distributions Based on Empirical Hankel Transforms In recent years, integral transforms have been used widely in statistical inference, especially in goodness-of-fit testing. Gamma distributions, with known shape parameters, are frequently used as models in many research areas, such as queueing theory, neuroscience, reliability theory and life testing. In this talk, we apply Hankel transform theory to propose an integral-type test statistic for goodness-of-fit tests for gamma distributions with known shape parameters. We prove that the null asymptotic distribution of the test statistic is a weighted sum of independent chi-square distributed random variables. Further, we show that the proposed test is consistent against each fixed alternative distribution, and we derive the non-null asymptotic distribution of the test statistic under a sequence of contiguous alternatives. Ge Zhao (Penn State) Mean residual life modeling and estimation with censoring and many covariates: An application in kidney transplant for renal failure patients We propose a flexible and general mean residual life model to predict an individual's residual life given his/her covariates. The prediction is based on an efficient semiparametric estimation of the covariate effect and a nonparametric estimation of the mean residual life function. This allows us to quantify the benefit that a renal failure patient would obtain from a potential kidney transplant 13 by comparing the difference between the expected residual lives of the patient if s/he receives the transplant and if s/he does not. It is a rational decision to allocate a kidney to the patient who would have the largest residual life increment among all those that are eligible for the transplant. Our analysis on the kidney transplant data from the U.S. Scientific Registry of Transplant Recipients indicates that the most important factor in making such decision is the waiting time for the transplant. We provide a clear formula that can be used to predict the potential gain of a patient given his/her covariate information and his/her waiting time. Generally speaking, a patient that has waited shorter time period for a kidney transplant has larger potential gain. We also identified an index which serves as an important predictor of a patient's gain of receiving a kidney transplant if the waiting time is approximately between 1.5 years and three years. Our general modeling and analysis strategies can be adopted to study other organ transplant problems. Hanyu Yang & Xiaoyu Liu (Wells Fargo Bank) Statistical Applications in Bank Risk Management Statistical methodologies are extensively used in credit risk and operational risk management in banking industry. In this presentation, we briefly describe statistical methodologies developed for risk modeling in Wells Fargo. The applications will include varying-coefficient loss forecast models, machine learning algorithms for benchmarking, feature engineering and interpretation tools. The presentation will be a joint work of Xiaoyu Liu and Hanyu Yang. Hyun Bin Kang (Penn State) A Functional Approach to Manifold Data Analysis with an Application to High-Resolution 3D Facial Imaging Many scientific areas are faced with the challenge of extracting information from large, complex, and highly structured data sets. A great deal of modern statistical work focuses on developing tools for handling such data. We present a new subfield of functional data analysis, FDA, which we call Functional Manifold Data Analysis, or FMDA. FMDA is concerned with the statistical analysis of samples where one or more variables measured on each unit is a manifold, thus resulting in as many manifolds as we have units. We propose a framework that converts manifolds into functional objects, a 2-step functional principal component method, and a manifold-on-scalar regression model. This work is motivated by and thus described with an anthropological application involving 3D facial imaging data. The proposed framework is used to understand how individual characteristics, such as age and genetic ancestry, influence the shape of the human face. Jiawei Wen (Penn State) Parallel multiblock ADMM for large scale optimization problems In recent years, the alternating direction method of multipliers (ADMM) has drawn considerable attention due to its applicability to massive optimization problems. For distributed learning problems, the multiblock ADMM and its variations have been rigorously studied in the literature. The idea is to partition the original problem into subproblems, each containing a subset of training samples or the learning parameters. At each iteration, the worker processors solve the subproblems and send the up-to-date variables to the master, who summarizes and broadcasts the 14 results to the workers. Hence, a given large-scale learning problem can be solved in a parallel and distributed way. In this paper, we apply the multiblock ADMM algorithm to parameter estimation in ultrahigh dimensional problems. The algorithm is based on a convergent 3-block ADMM algorithm. We restrict our attention to four important statistical problems: Dantzig selector, l1 penalized quantile regression, sparse linear discriminant analysis, and l1 norm support vector machine. A number of numerical experiments are performed to demonstrate the high efficiency and accuracy of the proposed method in high dimensions. Jordan Awan (Penn State) Structure and Sensitivity in Differential Privacy: Comparing K-Norm Mechanisms Limiting the disclosure risk of statistical analyses is a long-standing problem, and as more data are collected and shared, concerns about confidentiality arise and accumulate. Differential Privacy (DP) is a rigorous framework of quantifying the disclosure risk of statistical procedures computed on sensitive data. DP methods/mechanisms require the introduction of additional randomness beyond sampling in order to limit the disclosure risk. However, implementations often introduce excessive noise, reducing the utility and validity of statistical results, especially in finite samples. We study the class of K-Norm Mechanisms with the goal of releasing a statistic T with minimum finite sample variance. The comparison of these mechanisms is naturally related to the geometric structure of T. We introduce the adjacent output space S_T of T, which allows for the formal comparison of K-Norm Mechanisms, and the derivation of the uniformly-minimum-variance mechanism as a function of S_T. Using our methods, we extend the Objective Perturbation and the Functional Mechanisms, and apply them to Logistic and Linear Regression, allowing for private releases of statistical results. We compare the performance through simulations, and on a housing price dataset, demonstrating that our proposed methodology offers a substantial improvement in utility for the same level of risk. Joshua Snoke (Penn State) Differentially Private Synthetic Data with Maximal Distributional Similarity This paper concerns methods for the release of Differentially Private synthetic data sets. In many cases, data contains sensitive values which cannot be released in their original form in order to protect individuals‚Äô privacy. Synthetic data is a traditional protection method that releases alternative values in place of the original ones, and Differential Privacy is a formal guarantee for quantifying how disclosive any release my be. Our method maximizes the accuracy of the synthetic data over a standard measure of distributional similarity, the pMSE, relative to the original data, subject to the constraint of Differential Privacy. It also improves on previous methods by both relaxing some assumptions concerning the type or range of the original data. We provide theoretical results for the privacy guarantee and simulations for the empirical failure rate of the theoretical results under typical computational limitations. We also give simulations for the accuracy of statistics generated from the synthetic data compared with the accuracy of non-Differentially Private synthetic data and other previous Differentially Private methods. As an added benefit, our theoretical results also extend previous results on performing classification with Classification and Regression Tree (CART) models under the Differential Privacy setting, enabling the use of CART models with continuous predictors. 15 Justin Petrovich (Penn State) Functional Regression for Sparse and Irregular Functional Data In this work we present a new approach to fitting functional data models with sparsely and irregularly sampled data. The limitations of current methods have created major challenges in fitting more complex nonlinear models. Indeed, currently many models cannot be consistently estimated unless one assumes that the number of observed points per curve grows sufficiently quickly with the sample size. In contrast, we demonstrate an approach that has the potential to produce consistent estimates without such an assumption. Just as importantly, our approach propagates the uncertainty of not having completely observed curves, allowing for a more accurate assessment of uncertainty of parameter estimates, something that most methods currently cannot accomplish. This work is motivated by a longitudinal study on macrocephaly, in which electronic medical records allow for the collection of a great deal of data. However, the sampling is highly variable from child to child. Using our new approach we are able to clearly demonstrate that the development of pathologies related to macrocephaly is driven by both the overall head circumference of the children as well as the velocity of their head growth. Kevin Quinlan (Penn State) The Construction of Œµ-Bad Covering Arrays Covering Arrays are commonly used in software testing since testing all possible combinations is often impossible. A t-covering array covers all factor level combinations when projecting the design into any t factors. In high cost scenarios testing 100% of t factor combinations is infeasible. In this work, the assumption that all factor level combinations for any projection must be covered is relaxed. An Œµ-bad Covering Arrays covers only (1-Œµ)% of the factor levels combinations required for a t-covering array across the entire design. Some theoretical bounds for constructions of this type exist, but an explicit construction method is lacking. This work presents the first construction method for Œµ-bad arrays and results in a higher coverage than a given known result. The construction extends to an infinite number of factors, any fixed number of factor levels, and any strength t, where no systematic method had previously existed. The calculation of exact values of Œµ is detailed as well as the values when k ‚Üí ‚àû. When the cost of running additional experiments is high relative to the cost of missing errors, or only some initial results are needed these designs would be preferred over traditional t-covering arrays. Follow up experiments are done using a known construction method to obtain full t-coverage starting from an Œµ-bad array for some cases. Intermediate steps of this construction can additionally be used as Partial Covering Arrays. Finally, a case study shows the practical use of designs of this type in a hardware security testing application. Kyongwon Kim (Penn State) On Post Dimension Reduction Statistical Inference Contrasting to the rapid advances of the Sufficient Dimension Reduction methodologies, there has been a lack of development on post dimension reduction inference. The outcome of SDR is a set of sufficient predictors, but this is not the end of a typical data analysis process. In most 16 applications, the end product is an estimated statistical model, such as a Generalized Linear Model or a nonlinear regression models, furnished with procedures to construct confidence intervals and testing statistical hypothesis. However, to our knowledge, there has not been a systematic framework to perform these tasks after sufficient predictors are obtained. The central issue for post dimension reduction inference is to take into account both the statistical error produced model estimation and statistical error the underlying dimension reduction method. Our idea is to use the influence functions of statistical functionals as a vehicle to achieve this generality. Our post dimension reduction framework is designed such that one can input the influence functions of any dimension reduction method coupled with any estimation method to produce the asymptotic distributions taking both processes into account. Ling Zhang (Penn State) Feature Screening in Ultrahigh Dimensional Varying-coefficient Cox Model In this paper, we propose a two-stage feature screening procedure for varying-coefficient Cox model with ultrahigh dimensional covariates. The varying-coefficient model is flexible and powerful for modeling the dynamic effects of coefficients. In the literature, the screening methods for varying-coefficient Cox model are mainly limited to marginal measurements. Distinguished from the marginal screening, the proposed screening procedure is based on the joint partial likelihood of all predictors. Through this, the proposed procedure can effectively identify active predictors that are jointly dependent on, but marginally independent of the response. In order to carry out the proposed procedure, we propose an efficient algorithm and establish the ascent property of the proposed algorithm. We further prove that the proposed procedure possesses the sure screening property: with probability tending to one, the selected variable set includes the actual active predictors. Monte Carlo simulation is conducted to evaluate the finite sample performance of the proposed procedure, with a comparison to SIS procedure and sure joint screening (SJS) for Cox model. The proposed methodology is also illustrated through an empirical analysis of one real data example. Mengyan Li (Penn State) Semiparametric Regression for Measurement Error Model with Heteroscedastic Error Covariate measurement error is a common problem in many different studies. Improper treatment of measurement errors may affect the quality of estimation and the accuracy of inference. There has been extensive literature on the homoscedastic measurement error models. However, heteroscedastic measurement error issue is considered to be a difficult problem with less research available. In this paper, we consider a general parametric regression model allowing covariate measured with heteroscedastic error. We allow both the variance function of the measurement errors and the conditional density function of the error-prone covariate given the error-free covariates to be completely unspecified. We treat the variance function using Bsplines approximation and propose a semiparametric estimator based on efficient score functions to deal with the heteroscedasticity of measurement error. The resulting estimator is consistent and enjoys good inference properties. Its finite sample performance is demonstrated through simulation studies and a real data example. Michelle Pistner (Penn State) 17 Synthetic Data via Quantile Regression Statistical privacy in heavy-tailed data is a common and difficult problem, yet it has not been extensively studied. To this end, we investigated the effectiveness of frequentist and Bayesian quantile regression for generating heavy-tailed synthetic data as a possible privacy method in terms of both data utility and disclosure risk. We compared these syntheses to other commonly used models for synthetic data through simulations and applications to two census data sources. Simulations suggest that quantile regression can outperform other methods on the basis of utility in heavy-tailed data. Applications to real data sources suggest that quantile regression can generate data of high utility, yet maintain some privacy in the tails of the distributions. Sayali Phadke (Penn State) Causal Inference on Networked Data: Applications to Social Sciences Networked interactions among units of interest can lead to the effect of a treatment spreading to untreated units - via their connections to treated units. The Stable Unit Treatment Value Assumption (SUTVA) is held in conventional approaches to causal inference. It breaks down in the presence of such spreading. Many methods of estimating treatment effect presented in the literature incorporate the observed network of connections through a deterministic function of the network. We will enrich this structure by modeling a stochastic function; where whether there is spillover from one unit to the other will probabilistically depend on the structure of the network, their treatment status, and covariates. Our approach to modeling is to break the overall model for causal inference into two separate models; a spillover model and an outcome model. We will use an application from the Political Science domain to illustrate the proposed methodology. Trinetri Ghosh (Penn State) Flexible, Feasible and Robust Estimator of Average Causal Effect The existing methods to estimate the average treatment effect are often based on the parametric assumptions on treatment response models or propensity score function. But in reality, due to misspecification of these models the average treatment effect may be influenced. Hence we propose some robust methods which are less affected, when either plausible treatment response models or the propensity score function are not available. We use the semiparametric efficient score functions to estimate the propensity score function and the treatment response models. We also studied the asymptotic properties of these robust estimators. We conduct simulation studies when both the propensity score function and the treatment response models are correctly specified. Then we compare the behavior of different estimators when only propensity score function is correctly specified and the behavior of estimators when the propensity score function is misspecified. We also studied the performance of the estimators when all models are misspecified. Wanjun Liu (Penn State) Statistical methods for evaluating the correlation between timeline follow-back data and daily process data: results from a randomized controlled trial 18 Retrospective timeline follow-back data and prospective daily process data have been frequently collected in psychology research to characterize behavioral patterns. Although previous validity studies have demonstrated high correlations between these two types of data, the conventional method adopted in these studies was based on summary measures that may lose critical information, and the Pearson‚Äôs correlation coefficient that has an undesirable property. This study introduces the functional concordance correlation coefficient to address these issues and provides a new R package to implement it. We use real data collected from a randomized controlled trial to demonstrate the applications of this proposed method and compare its analytical results with those of the conventional method. The results of this real data example indicate that the correlations estimated by the conventional method tend to be higher than those estimated by the proposed method. A simulation study that was designed based on the observed real data and analytical results shows that the magnitude of overestimation associated with the conventional method is greatest when the true correlation is about the medium size. The findings of the real data example also imply that daily assessments are particularly beneficial for characterizing more variable behaviors like alcohol use, whereas weekly assessments may be sufficient for low variation events such as marijuana use. Xiufan Yu (Penn State) Revisiting Sufficient Forecasting: Nonparametric Estimation and Predictive Inference The sufficient forecasting (Fan et al., 2017) provides an effective nonparametric forecasting procedure to estimate sufficient indices from high-dimensional predictors in the presence of a possible nonlinear forecast function. In this paper, we first revisit the sufficient forecasting and explore its underlying connections to Fama-Macbeth regression and partial least squares. Then, we develop an inferential theory of sufficient forecasting within the high-dimensional framework with large cross sections, a large time dimension and a diverging number of factors. We derive the rate of convergence of the estimated factors and loadings and characterize the asymptotic behavior of the estimated sufficient forecasting directions without requiring the restricted linearity condition. The predictive inference of the estimated nonparametric forecasting function is obtained with nonparametrically estimated sufficient indices. We further demonstrate the power of the sufficient forecasting in an empirical study of financial markets. Xuening Zhu (Penn State) Network Vector Autoregression We consider here a large-scale social network with a continuous response observed for each node at equally spaced time points. The responses from different nodes constitute an ultra-high dimensional vector, whose time series dynamic is to be investigated. In addition, the network structure is also taken into consideration, for which we propose a network vector autoregressive (NAR) model. The NAR model assumes each node has response at a given time point as a linear combination of (a) its previous value, (b) the average of its connected neighbors, (c) a set of node-specific covariates, and (d) an independent noise. The corresponding coefficients are referred to as the momentum effect, the network effect, and the nodal effect respectively. Conditions for strict stationarity of the NAR models are obtained. In order to estimate the NAR model, an ordinary least squares type estimator is developed, and its asymptotic properties are 19 investigated. We further illustrate the usefulness of the NAR model through a number of interesting potential applications. Simulation studies and an empirical example are presented. Yuan Ke (Penn State) Robust factor models with covariates We study factor models when the latent factors can be explained by observed covariates. With those covariates, both the factors and loadings are identifiable up to a rotation matrix under finite dimensions, and can be estimated with a faster rate of convergence. To incorporate the explanatory power of these covariates, we propose a two-step estimation procedure: (i) regress the data onto the observables, and (ii) take the principal components of the fitted data to estimate the loadings and factors. The identification and estimation rely on the PCA properties of spiked low-rank matrices, which refer to a class of matrices that are low-rank with fast-diverging eigenvalues. We construct PCA estimators of the spiked low-rank matrix. The proposed estimator is robust to possibly heavy-tailed distributions, which are encountered in many applications for factor analysis. Empirically, our method leads to a substantial improvement on the out-of-sample forecast on the US bond excess return data. Yuji Samizo (Penn State) Secure Statistical Analyses on Distributed Databases Integrating multiple databases that are distributed among different data owners can be beneficial in numerous contexts of biomedical research. But the actual sharing of data is often impeded by concerns about data confidentiality. A situation like this require tools that can produce correct results while preserving data privacy. In recent years, many "secure" protocols have been proposed to solve specific statistical problems such as linear regression and classification. However, factors such as the complexity of these protocols, inability to assess model fit, and the lack of a platform to handle necessary data exchange have all prevented them from actually being used in real-life situations. We present practical approaches to perform statistical analyses securely on data held separately by multiple parties, without actually combining the data. Zeng Li (Penn State) On testing for high dimensional white noise Testing for white noise is a classical yet important problem in statistics, especially for diagnostic checks in time series modeling and linear regression. For high-dimensional time series in the sense that the dimension p is large in relation to the sample size T, the popular omnibus tests including the multivariate Hosking and Li-McLeod tests are extremely conservative, leading to substantial power loss. To develop more relevant tests for highdimensional cases, we propose a portmanteau-type test statistic which is the sum of squared singular values of the first q lagged sample autocovariance matrices. It, therefore, encapsulates all the serial correlations (upto the time lag q) within and across all component series. Using the tools from random matrix theory, we derive the normal limiting distributions, as both p and T 20 diverge to infinity, for this test statistic. As the actual implementation of the test requires the knowledge of three characteristic constants of the population cross-sectional covariance matrix and the value of the fourth moment of the standardized innovations, non trivial estimations are proposed for these parameters and their integration leads to a practically usable test. Extensive simulation confirms the excellent finite-sample performance of the new test with accurate size and satisfactory power for a large range of finite (p; T) combinations, therefore ensuring wide applicability in practice. In particular, the new tests are consistently superior to the traditional Hosking and Li-McLeod tests. Zheye Yuan (Penn State) Nonlinear Support Vector Machine for Multivariate Functional Data with Applications to fMRI and EEG Data Analysis We propose the estimation procedures for nonlinear support vector machine (SVM) where the predictor is a vector of random functions and the response is labels. The relation between the response and predictor can be nonlinear and the sets of observed time points can vary from subject to subject. The functional and nonlinear nature of the problem leads to a construction of two functional spaces: the first representing the functional data, assumed to be a Hilbert space, and the second characterizing nonlinear- ity, assumed to be a reproducing kernel Hilbert space. A particularly attractive feature of our construction is that the two spaces are nested, in the sense that the kernel for the second space is determined by the inner product of the first. We propose multiple estimators of varying complexities with respective effectiveness in differing situations. We apply this method to data sets on electroencephalogram (EEG) measurements for potentially alcoholic patients and on functional magnetic resonance imaging (fMRI) measurements for potentially attention deficit hyperactivity disorder (ADHD) patients. 21