Uploaded by etisumiyati46

AbstractBook

advertisement
Abstract book
May 8-10, 2018
1
Abstracts of Invited Talks
(Listed by the order of presentation)
Session 1: 9:05 am - 10:25 am, May 9
Chair: Michael Akritas (Penn State)
Title: Estimation of the Boundary of a Variable observed with Symmetric Error
Speaker: Ingrid Van Keilegom (KU Leuven, Belgium)
Abstract: Consider the model $Y = X + \varepsilon$ with $X = \tau + Z$, where $\tau$ is an
unknown constant (the boundary of $X$), $Z$ is a random variable defined on $R^+$,
$\varepsilon$ is a symmetric error, and $\varepsilon$ and $Z$ are independent. Based on a iid
sample of $Y$ we aim at identifying and estimating the boundary $\tau$ when the law of
$\varepsilon$ is unknown (apart from symmetry) and in particular its variance is unknown. We
propose an estimation procedure based on a minimal distance approach and by making use of
Laguerre polynomials. Asymptotic results as well as finite sample simulations are shown. The
paper also proposes an extension to stochastic frontier analysis, where the model is conditional to
observed variables. The model becomes $Y = \tau(w_1,w_2) + Z +\varepsilon$, where $Y$ is a
cost, $w_1$ are the observed outputs and $w_2$ represents the observed values of other
conditioning variables, so $Z$ is the cost inefficiency. Some simulations illustrate again how the
approach works in finite samples.
Title: Wild residual bootstrap inference for penalized quantile regression with heteroscedastic
errors
Speaker: Lan Wang (University of Minnesota)
Abstract: We consider a heteroscedastic regression model in which some of the regression
coefficients are zero but it is not known which ones. Penalized quantile regression is a useful
approach for analyzing such heterogeneous data. By allowing different covariates to be relevant
for modeling conditional quantile functions at different quantile levels, it permits a more realistic
sparsity assumption and provides a more complete picture of the conditional distribution of a
response variable. Existing work on penalized quantile regression has been mostly focused on
point estimation. It is challenging to estimate the standard error. Although bootstrap procedures
have recently been demonstrated effective for making inference for penalized mean regression,
they are not directly applicable to penalized quantile regression with heteroscedastic errors. We
prove that a wild residual bootstrap procedure recently proposed by Feng et al. (2011) for
unpenalized quantile regression is asymptotically valid for approximating the distribution of a
penalized quantile regression estimator with an adaptive L1 penalty; and that a modified version
of this wild residual bootstrap procedure can be used to approximate the distribution of L1
penalized quantile regression. We establish the bootstrap consistency theory, demonstrate
2
appropriate finite sample performance through a simulation study, and illustrate its application
using an ozone effects data set. The new methods do not need to estimate the unknown error
density function. (Joint work with Ingrid van Keilegom and Adam Maidman)
Title: A two-sample test of equality of means in high dimensional data
Speaker: Haiyan Wang (Kansas State University)
Abstract: This research is interested in testing equality of two sample means in high dimensional
data in which the sample sizes may be much less than the dimension. Improvement still can be
achieved despite significant effort in recent literature that modifies the Hotelling's T2-statistics
by either bypassing the estimation of high dimensonal covariance matrices (cf. Chen & Qin
2010 Annals of Stat., Srivastava et al. 2013 JMVA, Gregory et al. 2015 JASA) or estimating the
precision matrix after imposing sparseness condition (Cai et al. 2014 JRSSB). Here we present a
new test statistic that is particularly powerful when the correlation between components of the
data vector reduces as the separation of the component indices increases. The limiting
distribution of the test statistic and power of the test are studied. Simulation results will be
presented to show the numerical performance of the test and to compare with other tests in the
literature.
Title: Bivariate Tests for Location Based on Data Depth
Speaker: Thomas Hettmansperger (Penn State University)
Abstract: Starting with the ideas contained in the data depth paper by Regina Liu (AOS, 1990),
we develop the simplicial data depth concept into a bivariate test of location that is distribution
free under the assumption of angular symmetry. The test statistic counts the number of data
triangles that contain the null hypothesized value. A straightforward method for computing this
statistic is provided. The exact null distribution is discussed, as well as the asymptotic null
distribution along with a formula for an approximate critical value. The method is illustrated on a
data set.
Session 2: 10:40 am - 12:00 noon, May 9
Chair: Bing Li (Penn State)
Title: Multilayer tensor factorization with applications to recommender systems
Speaker: Annie Qu (University of Illinois at Urbana-Champaign)
Abstract: Recommender systems have been widely adopted by electronic commerce and
entertainment industries for individualized prediction and recommendation, which benefit
consumers and improve business intelligence. In this article, we propose an innovative method,
namely the recommendation engine of multilayers (REM), for tensor recommender systems. The
proposed method utilizes the structure of a tensor response to integrate information from
multiple modes, and creates an additional layer of nested latent factors to accommodate betweensubjects dependency. One major advantage is that the proposed method is able to address the
3
``cold-start" issue in the absence of information from new customers, new products or new
contexts. Specifically, it provides more effective recommendations through sub-group
information. To achieve scalable computation, we develop a new algorithm for the proposed
method, which incorporates a maximum block improvement strategy into the cyclic block-wisecoordinate-descent algorithm. In theory, we investigate both algorithmic properties for global
and local convergence, along with the asymptotic consistency of estimated parameters. Finally,
the proposed method is applied in simulations and IRI marketing data with 116 million
observations of product sales. Numerical studies demonstrate that the proposed method
outperforms existing competitors in the literature. This is joint work with Xuan Bi and Xiaotong
Shen.
Title: On asymmetric dependence in ordinal categorical data: sub copula-based regression
approach
Speaker: Daeyoung Kim (University of Massachusetts)
Abstract: For the analysis of ordinal contingency tables, a new asymmetric association measure
is developed. The proposed method uses a nonparametric and model-free approach, the bilinear
extension copula-based regression between ordinal categorical variables, to measure the
asymmetric predictive powers of the variables of interest. Unlike the existing measures of
asymmetric association, the proposed bilinear extension copula-based measure is able to capture
nonlinear pattern and the magnitude of the proposed measure can be interpreted as the degree of
asymmetric association in the ordinal contingency table. The theoretical properties of the
proposed asymmetric association measure are investigated. We illustrate the performance and
advantages of the proposed measure using simulation studies and real data examples.
Title : Clustering Data on the Sphere: State of the art and a Poisson kernel-based Algorithm
Speaker: Markatou, Marianthi (University at Buffalo)
Abstract: Model-based clustering of directional data has been widely used by many authors using
mixtures of different directional distributions such as von Mises-Fisher, inverse stereographic
projections of multivariate normal and Watson distributions. We discuss a clustering method
based on mixtures of Poisson kernel based distributions on the sphere, derive the estimates of the
parameters and describe the corresponding clustering algorithm. We discuss convergence of the
algorithm and study the role of initialization on its performance, where performance is measured
by ARI, macro-precision and macro-recall. A comparison study shows that Poisson kernel based
clustering performs, in many cases, superior to the state of the art, mixture of von Mises-Fisher
distribution. We describe an algorithm for generating high-dimensional directional data from the
Poisson kernel based distribution for a simulation-based comparison, and propose a new method
for estimating the number of clusters in this setting.
Title: Normalization of transcript degradation improves accuracy in RNA-seq analysis
Speaker: Jiping Wang (Northwestern University)
4
Abstract: RNA-sequencing (RNA-seq) is a powerful high-throughput tool to profile
transcriptional activities in cells. The observed read counts can be biased by various factors such
that they do not accurately represent the true relative abundance of mRNA transcript abundance.
Normalization is a critical step to ensure unbiased comparison of gene expression between
samples or conditions. Here we show that the gene-specific heterogeneity of transcript
degradation pattern across samples presents a common and major source of unwanted variation,
and it may substantially bias the results in gene expression analysis. Most existing normalization
approaches focused on global adjustment of systematic bias are ineffective to correct for this bias.
We propose a novel method based on matrix factorization over-approximation that allows
quantification of RNA degradation of each gene within each sample. The estimated degradation
index scores are used to build a pipeline named DeGNorm (stands for degradation normalization)
to adjust read count for RNA degradation heterogeneity on a gene-by-gene basis while
simultaneously controlling sequencing depth. The robust and effective performance of this
method is demonstrated in an extensive set of real RNA-seq data and simulated data.
Session 3: 9:00 am - 10:20 am, May 10
Chair: Bharath Sriperumbudur (Penn State)
Title: New computational methods for nonparametric quantile regression and its related models
Speaker: Bo Kai (College of Charleston)
Abstract: Quantile regression aims at estimating the conditional quantiles of the response
variable. Compared to least squares regression, quantile regression provides a more
comprehensive picture of the relationship between the response and its covariates. The
optimization for quantile regression is challenging because the objective function is nondifferentiable. In this work, we focus on the optimization problems in nonparametric quantile
regression and its related models. Existing algorithms may yield estimates that are not very
smooth or stable. To address these issues, we propose a new class of algorithms which produce
smoother and stabler estimates in nonparametric quantile regression models. The finite sample
performance of the proposed algorithms is investigated in several numerical studies.
Title : Hyperrectangular Tolerance and Prediction Regions for Setting Multivariate Reference
Regions in Laboratory Medicine
Speaker: Derek Young (University of Kentucky)
Abstract: Reference regions are widely used in clinical chemistry and laboratory medicine to
interpret the results of biochemical or physiological tests of patients. There are well-established
methods in the literature for reference limits for univariate measurements, however, only limited
methods are available for the construction of multivariate reference regions. This is because
5
traditional multivariate statistical regions (e.g., confidence, prediction, and tolerance regions) are
not constructed based on a hyperrectangular geometry. We address this problem by developing
multivariate hyperrectangular nonparametric tolerance regions for setting the reference regions.
Our approach utilizes statistical data depth to determine which points to trim and then the
extremes of the trimmed dataset are used as the faces of the hyperrectangular region. Extensive
coverage results show the favorable performance of our algorithm provided a minimum sample
size criterion is met. Our procedure is used to obtain reference regions for insulin-like growth
factor concentrations in the serum of healthy adults.
Title : Temporal Exponential-Family Random Graph Models with Time-Evolving Latent Block
Structure for Dynamic Networks
Speaker: Kevin Lee (Western Michigan University)
Abstract: Model-based clustering of dynamic networks has emerged as an important research
topic in statistical network analysis. It is critical to effectively and efficiently model the timeevolving latent block structure of dynamic networks in practice. However, the focus of most
existing methods is on the static or temporally invariant block structure. We present a principled
statistical clustering of large-scale dynamic networks through the temporal exponential-family
random graph models with a hidden Markov structure. The hidden Markov structure is used to
effectively infer the time-evolving block structure of dynamic networks. We prove the
identification conditions for both network parameters and transition matrix in our proposed
model-based clustering. We propose an effective model selection criterion based on the
integrated classification likelihood to choosing an appropriate number of clusters. We develop a
scalable variational expectation-maximization algorithm to efficiently solve the approximate
maximum likelihood estimate. The numerical performance of our proposed method is
demonstrated in simulation studies and two real data applications to dynamic international trade
networks and dynamic email networks of a large institute.
Title: Causal Inference via Balancing Covariates
Speaker: Yeying Zhu (University of Waterloo, Canada)
Abstract: An important goal in estimating the causal effect is to achieve balance in the covariates.
We propose using kernel distance to measure balance across different treatment groups and
propose a new propensity score estimator by setting the kernel distance to be zero. Compared to
other balance measures, such as absolute standardized mean difference (ASMD) and
Kolmogorov Smirnov (KS) statistic, Kernel distance is one of the best bias indicators in
estimating the causal effect. The estimating equations are solved by generalized method of
moments. Simulation studies are conducted across different scenarios varying in the degree of
nonlinearity in both the propensity score model and the outcome model. The proposed approach
outperforms many existing approaches including the well-known covariate balance propensity
score (CBPS) approach when the propensity score model is mis-specified. An application to data
from International Tobacco Control (ITC) policy evaluation project is provided.
6
Session 4: 10:35 am-11:55 am, May 10
Chair: Le Bao (Penn State)
Title : Transformed Variable Selection in Sufficient Dimension Reduction
Speaker: Yuexiao Dong (Temple University)
Abstract: In this paper, we combine variable transformation with sufficient dimension reduction
to achieve model-free variable selection. Existing model-free variable selection methods via
sufficient dimension reduction requires a critical assumption that the predictor distribution is
elliptically contoured. We suggest a nonparametric variable transformation method after which
the predictors become normal. Variable selection is then performed based on the marginally
transformed predictors. Asymptotic theory is established to support the proposed method. The
desirable variable selection performance of the proposed method is demonstrated through
simulation studies and a real data analysis.
Title : Central Quantile Subspace
Speaker: Eliana Christou (University of North Carolina at Charlotte)
Abstract: Existing dimension reduction techniques focus on the conditional distribution of the
response given the covariates, where specific interest focuses on statistical functionals of the
distribution, such as the conditional mean, conditional variance and conditional quantile. We
introduce a new method for inferring about the conditional quantile of the response given the
covariates and we introduce the notion of the Central Quantile Subspace (CQS). The purpose of this
paper is threefold. First, we focus on cases where the tau-th conditional quantile, for tau in (0,1),
depends on the predictor X through a single linear combination B'X and we show that we can
estimate B consistently up to a multiplicative scalar, even though the estimate might be based on a
misspecified link function. Second, we extend the result to tau-th conditional quantiles that depend
on X through a d - dimensional linear combination B'X, where B is a p x d matrix, d>1, and propose
an iterative procedure to produce more vectors in the tau-th CQS, which are shown to be root n
consistent. Third, we extend our proposed methodology by considering any statistical functional of
the conditional distribution and estimate the fewest linear combinations of X that contain all the
information on that functional.
Title : A joint learning of multiple precision matrices with sign consistency
Speaker: Yuan Huang (University of Iowa)
Abstract: The Gaussian graphical model is a popular tool for inferring the relationships among
random variables, where the precision matrix has a natural interpretation of conditional
independence. With high-dimensional data, sparsity of the precision matrix is often assumed, and
various regularization methods have been applied for estimation. Under quite a few important
7
scenarios, it is desirable to conduct the joint estimation of multiple precision matrices. In joint
estimation, entries corresponding to the same element of multiple precision matrices form a
group, and group regularization methods have been applied for estimation and identification of
the sparsity structures. For many practical examples, it can be difficult to interpret the results
when parameters within the same group have conflicting signs. Unfortunately, the existing
methods lack an explicit mechanism concerning with the sign consistency of group parameters.
To tackle this problem, we develop a novel regularization method for the joint estimation of
multiple precision matrices. It effectively promotes the sign consistency of group parameters and
hence can lead to more interpretable results, while still allowing for conflicting signs to achieve
full flexibility. Its consistency properties are rigorously established. Simulation shows that the
proposed method outperforms the competing alternatives under a variety of settings.
Title: Decoding the Perception of Music Genres with High-resolution fMRI Data
Speaker: Han Hao (University of North Texas)
Abstract: Recent studies have demonstrated a close relationship between computational acoustic
features and neural brain activities, and have largely advanced our understanding of auditory
information processing in the human brain. However, differentiating music genres requires
mental processing that is sensitive to specific auditory and schematic information‚ the precise
features of which, as well as their cortical organization, are yet to be properly understood. We
developed a multivariate clustering approach for fMRI data based on stimulus encoding models.
Analysis on a fMRI data from music-listening tasks yielded significant clusters and revealed
geometric patterns of music cognition.
Session 5: 2:00 pm - 3:20 pm, May 10
Chair: Aleksandra Slavkovic (Penn State)
Title: Improving Small-Area Estimates of Disability: Combining the American
Community Survey with the Survey of Income and Program Participation
Speaker: Jerry Maples (U.S. Census Bureau)
Abstract: The Survey of Income and Program Participation (SIPP) is designed to make national
level estimates of changes in income, eligibility for and participation in transfer programs,
household and family composition, labor force behavior, and other associated events. Used
cross-sectionally, the SIPP is the source for commonly accepted estimates of disability
prevalence, having been cited in the findings clause of the Americans with Disability Act.
Because of its sample size, SIPP is not designed to produce highly reliable estimates for
8
individual states. The American Community Survey (ACS) is a large sample survey which is
designed to support estimates of characteristics at the state and county level, however, the
questions about disability in the ACS are not as comprehensive and detailed as in SIPP. We
propose combining the information from the SIPP and ACS surveys to improve, i.e. lower
variances of, state estimates of disability (as defined by SIPP).
Speaker: Yanling Zuo (Minitab Inc)
Title: Helping Users to Make High-Quality Products -- My Wonderful Professional Life at
Minitab Inc
Abstract: My presentation contains the following aspects: 1) Provide background Info of
Minitab Inc, Six Sigma Methodology, and myself; 2) Use a sample project on developing
stability study commands for estimating shelf-life of a drug for pharmaceutical industry to
illustrate my work at Minitab; 3) By reflecting my work experience, promote strong
collaborations among statisticians, computer scientists, and subject-knowledge experts to solve
concrete problems using big data. Sincerely hope this type of collaborations can be built into
PSU statistics department undergraduate and graduate education programs.
Title : Use of Advanced Statistical Techniques in Banking
Speakers: Xiaoyu Liu & Hanyu Yang (Wells Fargo)
Abstract: There is a great deal of interest in the use of advanced statistical methodologies in
finance industry. In the presentation, we will give an overview of the statistical approaches we
developed for predictive modeling in credit and operational risk management, including varyingcoefficient loss forecast models and fractional response modeling techniques. We will also
introduce the use of machine learning techniques such as ensemble algorithms and neural
networks in loss prediction, as well as the diagnostics and interpretable tools for opening up the
"black box" of machine learning techniques. If time permits, we will describe the quantitative
communities at Wells Fargo as well as employment opportunities.
Title: A novel method for estimating the causal effects of latent classes in complex survey data.
Speaker: Joseph Kang (US Centers for disease control and prevention)
Abstract: In the literature of behavioral sciences, latent class analysis (LCA) has been used to
effectively cluster multiple survey items. Statistical inference with an exposure variable, which is
identified by the LCA model, is challenging because 1) the exposure variable is essentially
missing and harbors the uncertainty of estimating parameters in the LCA model and 2)
confounding bias adjustments need relevant propensity score models. In addition to these
challenges, complex survey design features and survey weights will have to be accounted
for if they are present. Our solutions to these issues are to 1) assess point estimates with the
design-based estimating function approach which was described in Binder (1983) and 2) obtain
variance estimates with the Jackknife technique. Using the NHANES data set, our LCA model
identified a latent class for men who have sex with men (MSM) and built new propensity score
weights that adjusted the prevalence rates of Herpes Simplex Virus Type 2 (HSV-2) for MSM.
9
Abstracts of Posters
(Ordered by first name)
Ann Johnston (Penn State)
An Algebraic Approach to Categorical Data Fusion for Population Size Estimation
Information from two or more administrative registries can be combined to estimate (via capturerecapture) the size of a population. As a preliminary step, this requires assignment of betweenregistry record linkage. Often, errors are present in the fused data, with the used record linkage
assignment process suggesting an error mechanism (eg: ghost record creation, failure to match,
band reading error). Recent work has used a latent multinomial model (in a Bayesian framework)
to incorporate record linkage error mechanisms into the population size estimation process.
Given observed capture histories, the associated fiber of all possible latent capture histories can
be explored via a block Metropolis-Hastings algorithm, with chain irreducibility guaranteed by
proposing moves from a Markov basis appropriate to the error mechanism. Existing work has
assumed independence between registries. We relax this assumption, studying the fiber of latent
capture histories under minimal odds ratio assumptions. Further, we use algebraic ideas to extend
this approach to the setting of registries equipped with covariates, where the problem becomes
the more general problem of data fusion.
Ardalan Mirshani (Penn State)
Adaptive Function-on-Scalar Smoothing Elastic Net
We propose a new methodology, called AFSSEN, to simultaneously select important variables
and produce smooth estimates of their parameters in a function-on-scalar linear model with subGaussian errors and high-dimensional predictors. In our model, the data live in an general real
separable Hilbert space, H, but an arbitrary linear operator of the parameters are enforced to lie
in an RKHS, K, so that the parameter estimates inherit properties from the RKHS kernel, such as
smoothness and periodicity. We use a regularization method that exploits an adaptive Elastic Net
penalty where the L1 and L2 norms are introduced in H and K respectively. Using two norms
we are able to better control both smoothing and variable selection. AFFSEN is illustrated via a
simulation study and microbiome data using a very fast algorithm for computing the estimators
based on a functional coordinate descent whose interface is written in R, but with the backend
written in C++.
Arun Srinivasan (Penn State)
Compositional Knockoff Filter for FDR Control in Microbiome Regression Analysis
10
A critical task in microbiome analysis is to identify microbial taxa that are associated with a
response of interest. Most existing statistical methods examine the association between the
response and one microbiome feature at a time, then followed by multiple testing adjustment
such as false discovery rate (FDR) control. Despite feasibility, these methods are often
underpowered due to some unique characteristics of microbiome data, such as highdimensionality, compositional constraint and complex correlation structure. In this paper, we
adapt the use of the knockoff filter to provide strong finite sample false discovery rate control in
the context of linear log-contrast models for regression analysis of microbiome compositional
data. Alternative to applying multiple testing correction to large number of individual p-values,
our framework achieved the FDR control in a regression model that jointly analyzes the whole
microbiome community. By imposing an l1-regularization in the regression model, a subset of
bugs is selected as related to the response under a pre-specified FDR threshold. The proposed
method is demonstrated via simulation studies and its usefulness is illustrated by an application
to a microbiome study relating microbiome composition to host gene expression.
Beomseok Seo (Penn State)
Computing Mean Partition and Assessing Uncertainty for Clustering Analysis
In scientific data analysis, clusters identified computationally often substantiate existing or
motivate new hypothesis. Due to the combinatorial nature of the clustering result, which is a
partition rather than a set of parameters or a function, the notions of mean and variance are not
clear-cut. This intrinsic difficulty hinders the development of methods to improve clustering by
aggregation or to assess the uncertainty of clusters generated. We overcome the barrier by
aligning clusters via soft matching solved by optimal transport. Equipped with this technique,
we propose a new algorithm to enhance clustering by any baseline method using bootstrap
samples. In addition, the cluster alignment enables us to quantify variation in the clustering result
at the levels of both overall partitions and individual clusters. Topological relationships between
clusters such as match, split, and merge can be revealed. A confidence point set for each cluster,
a concept kin to the confidence interval, is proposed. The tools we have developed here will help
address the crucial question of whether any cluster is an intrinsic or spurious pattern.
Experimental results on both simulated and real data sets are provided.
Changcheng Li (Penn State)
Optimal Projection Tests in High Dimensionality
Hypothesis testing is of great importance in multivariate statistics, such as one-sample mean
testing and two-sample mean testing problems. When the dimensionality of a population is high,
classical methods like likelihood ratio test and Hotelling's $T^2$ test become infeasible due to
the singularity of sample covariance matrix. In this article, we propose a framework called $U$projection test to deal with hypothesis testing of high-dimensional multivariate linear regression
coefficients. We first discuss about projection tests for the hypothesis testing of linear regression
coefficients and obtain results about the optimal projection direction.
$U$-projection test utilizes the information provided by the covariance in the construction of test
11
in an optimal way, while it avoids the power loss caused by sample splitting in the samplesplitting projection test. It gives us the flexibility in estimation of projection direction, and
different estimation scheme of projection directions can lead to different tests. In fact, the
framework of $U$-projection test gives us a way to extend various existing test statistics to
general linear regression coefficient testing problems from mean testing problems. This
flexibility makes $U$-projection test applicable to various alternative hypotheses (whether they
are sparse or not) in various covariance structure situations. Various properties of $U$-projection
test and its connection to some existing tests are studied and numerical studies are also carried
out. We show that the proposed $U$-projection test is asymptotic equivalent to some existing
tests in low-correlation cases and is superior in some high-correlation ones.
Christian Schmid (Penn State)
Exponential random graph models with big networks: Maximum pseudolikelihood estimation
and the parametric bootstrap
With the growth of interest in network data across fields, the Exponential Random Graph Model
(ERGM) has emerged as the leading approach to the statistical analysis of network data. ERGM
parameter estimation requires the approximation of an intractable normalizing constant.
Simulation methods represent the state-of-the-art approach to approximating the normalizing
constant, leading to estimation by Monte Carlo maximum likelihood (MCMLE). MCMLE is
accurate when a large sample of networks is used to approximate the normalizing constant.
However, MCMLE is computationally expensive, and may be prohibitively so if the size of the
network is on the order of 1,000 nodes (i.e., one million potential ties) or greater. When the
network is large, one option is maximum pseudolikelihood estimation (MPLE). The standard
MPLE is simple and fast, but generally underestimates standard errors. We show that a
resampling method-the parametric bootstrap-results in accurate coverage probabilities for
confidence intervals. We find that bootstrapped MPLE can be run in 1/5th the time of MCMLE.
We study the relative performance of MCMLE and MPLE with simulation studies, and illustrate
the two different approaches by applying them to a network of bills introduced in the United
State Senate.
Claire Kelling (Penn State)
Combining Geographic and Social Proximity to Model Urban Domestic and Sexual Violence
In order to understand the dynamics of crime in urban areas, it is important to investigate the
socio-demographic attributes of the communities as well as the interactions between
neighborhoods. If there are strong social ties between two neighborhoods, they may be more
likely to transfer ideas, customs, and behaviors between them. This implies that not only crime
itself but also crime prevention and interventions could be transferred along these social ties.
Most studies on crime rate inference use spatial statistical models such as spatially weighted
regression to take into account spatial correlation between neighborhoods. However, in order to
obtain a more flexible model for how crime may be related across communities, one must take
into account social proximity in addition to geographic proximity. In this paper, we develop
techniques to combine geographic and social proximity in spatial generalized linear mixed
models in order to estimate domestic and sexual violence in Detroit, Michigan and Arlington
12
County, Virginia. The analysis relies on combining data from local and federal data sources such
as the Police Data Initiative and American Community Survey. By comparing three types of
CAR models, we conclude that adding information on social proximity to spatial models, we
create more accurate estimation of crime in communities.
Debmalya Nandy (Penn State)
Covariate Information Number for Feature Screening in Ultrahigh Dimension
Modern technological advances in various scientific fields generate ultrahigh-dimensional
supervised data with sparse signals, i.e. a limited number of samples (n) each with a very large
number of covariates (p >> n), only a small share of which is truly associated with the response.
In these settings, major concerns on computational burden, algorithmic stability, and statistical
accuracy call for substantially reducing the feature space by eliminating redundant covariates
before the application of any sophisticated statistical analysis. Following the development of
Sure Independence Screening (Fan and Lv, 2008, JRSS-B) and other model- and correlationbased feature screening methods, we propose a model-free procedure called the Covariate
Information Screening (CIS). CIS uses a marginal utility built upon Fisher Information,
possesses the sure screening property, and is applicable to any type of response. An extensive
simulation study and an application to transcriptomic data in rats reveal CIS’s comparative
strengths over some popular feature screening methods.
Elena Hadjicosta (Penn State)
Consistent Goodness-of-Fit Tests for Gamma Distributions Based on Empirical Hankel
Transforms
In recent years, integral transforms have been used widely in statistical inference, especially in
goodness-of-fit testing. Gamma distributions, with known shape parameters, are frequently used
as models in many research areas, such as queueing theory, neuroscience, reliability theory and
life testing. In this talk, we apply Hankel transform theory to propose an integral-type test
statistic for goodness-of-fit tests for gamma distributions with known shape parameters. We
prove that the null asymptotic distribution of the test statistic is a weighted sum of independent
chi-square distributed random variables. Further, we show that the proposed test is consistent
against each fixed alternative distribution, and we derive the non-null asymptotic distribution of
the test statistic under a sequence of contiguous alternatives.
Ge Zhao (Penn State)
Mean residual life modeling and estimation with censoring and many covariates: An application
in kidney transplant for renal failure patients
We propose a flexible and general mean residual life model to predict an individual's residual life
given his/her covariates. The prediction is based on an efficient semiparametric estimation of the
covariate effect and a nonparametric estimation of the mean residual life function. This allows us
to quantify the benefit that a renal failure patient would obtain from a potential kidney transplant
13
by comparing the difference between the expected residual lives of the patient if s/he receives the
transplant and if s/he does not. It is a rational decision to allocate a kidney to the patient who
would have the largest residual life increment among all those that are eligible for the transplant.
Our analysis on the kidney transplant data from the U.S. Scientific Registry of Transplant
Recipients indicates that the most important factor in making such decision is the waiting time
for the transplant. We provide a clear formula that can be used to predict the potential gain of a
patient given his/her covariate information and his/her waiting time. Generally speaking, a
patient that has waited shorter time period for a kidney transplant has larger potential gain. We
also identified an index which serves as an important predictor of a patient's gain of receiving a
kidney transplant if the waiting time is approximately between 1.5 years and three years. Our
general modeling and analysis strategies can be adopted to study other organ transplant problems.
Hanyu Yang & Xiaoyu Liu (Wells Fargo Bank)
Statistical Applications in Bank Risk Management
Statistical methodologies are extensively used in credit risk and operational risk management in
banking industry. In this presentation, we briefly describe statistical methodologies developed
for risk modeling in Wells Fargo. The applications will include varying-coefficient loss forecast
models, machine learning algorithms for benchmarking, feature engineering and interpretation
tools. The presentation will be a joint work of Xiaoyu Liu and Hanyu Yang.
Hyun Bin Kang (Penn State)
A Functional Approach to Manifold Data Analysis with an Application to High-Resolution 3D
Facial Imaging
Many scientific areas are faced with the challenge of extracting information from large, complex,
and highly structured data sets. A great deal of modern statistical work focuses on developing
tools for handling such data. We present a new subfield of functional data analysis, FDA, which
we call Functional Manifold Data Analysis, or FMDA. FMDA is concerned with the statistical
analysis of samples where one or more variables measured on each unit is a manifold, thus
resulting in as many manifolds as we have units. We propose a framework that converts
manifolds into functional objects, a 2-step functional principal component method, and a
manifold-on-scalar regression model. This work is motivated by and thus described with an
anthropological application involving 3D facial imaging data. The proposed framework is used
to understand how individual characteristics, such as age and genetic ancestry, influence the
shape of the human face.
Jiawei Wen (Penn State)
Parallel multiblock ADMM for large scale optimization problems
In recent years, the alternating direction method of multipliers (ADMM) has drawn considerable
attention due to its applicability to massive optimization problems. For distributed learning
problems, the multiblock ADMM and its variations have been rigorously studied in the literature.
The idea is to partition the original problem into subproblems, each containing a subset of
training samples or the learning parameters. At each iteration, the worker processors solve the
subproblems and send the up-to-date variables to the master, who summarizes and broadcasts the
14
results to the workers. Hence, a given large-scale learning problem can be solved in a parallel
and distributed way. In this paper, we apply the multiblock ADMM algorithm to parameter
estimation in ultrahigh dimensional problems. The algorithm is based on a convergent 3-block
ADMM algorithm. We restrict our attention to four important statistical problems: Dantzig
selector, l1 penalized quantile regression, sparse linear discriminant analysis, and l1 norm
support vector machine. A number of numerical experiments are performed to demonstrate the
high efficiency and accuracy of the proposed method in high dimensions.
Jordan Awan (Penn State)
Structure and Sensitivity in Differential Privacy: Comparing K-Norm Mechanisms
Limiting the disclosure risk of statistical analyses is a long-standing problem, and as more data
are collected and shared, concerns about confidentiality arise and accumulate. Differential
Privacy (DP) is a rigorous framework of quantifying the disclosure risk of statistical procedures
computed on sensitive data. DP methods/mechanisms require the introduction of additional
randomness beyond sampling in order to limit the disclosure risk. However, implementations
often introduce excessive noise, reducing the utility and validity of statistical results, especially
in finite samples. We study the class of K-Norm Mechanisms with the goal of releasing a
statistic T with minimum finite sample variance. The comparison of these mechanisms is
naturally related to the geometric structure of T. We introduce the adjacent output space S_T of
T, which allows for the formal comparison of K-Norm Mechanisms, and the derivation of the
uniformly-minimum-variance mechanism as a function of S_T. Using our methods, we extend
the Objective Perturbation and the Functional Mechanisms, and apply them to Logistic and
Linear Regression, allowing for private releases of statistical results. We compare the
performance through simulations, and on a housing price dataset, demonstrating that our
proposed methodology offers a substantial improvement in utility for the same level of risk.
Joshua Snoke (Penn State)
Differentially Private Synthetic Data with Maximal Distributional Similarity
This paper concerns methods for the release of Differentially Private synthetic data sets. In many
cases, data contains sensitive values which cannot be released in their original form in order to
protect individuals’ privacy. Synthetic data is a traditional protection method that releases
alternative values in place of the original ones, and Differential Privacy is a formal guarantee for
quantifying how disclosive any release my be. Our method maximizes the accuracy of the
synthetic data over a standard measure of distributional similarity, the pMSE, relative to the
original data, subject to the constraint of Differential Privacy. It also improves on previous
methods by both relaxing some assumptions concerning the type or range of the original data.
We provide theoretical results for the privacy guarantee and simulations for the empirical failure
rate of the theoretical results under typical computational limitations. We also give simulations
for the accuracy of statistics generated from the synthetic data compared with the accuracy of
non-Differentially Private synthetic data and other previous Differentially Private methods. As
an added benefit, our theoretical results also extend previous results on performing classification
with Classification and Regression Tree (CART) models under the Differential Privacy setting,
enabling the use of CART models with continuous predictors.
15
Justin Petrovich (Penn State)
Functional Regression for Sparse and Irregular Functional Data
In this work we present a new approach to fitting functional data models with sparsely and
irregularly sampled data. The limitations of current methods have created major challenges in
fitting more complex nonlinear models. Indeed, currently many models cannot be consistently
estimated unless one assumes that the number of observed points per curve grows sufficiently
quickly with the sample size. In contrast, we demonstrate an approach that has the potential to
produce consistent estimates without such an assumption. Just as importantly, our approach
propagates the uncertainty of not having completely observed curves, allowing for a more
accurate assessment of uncertainty of parameter estimates, something that most methods
currently cannot accomplish. This work is motivated by a longitudinal study on macrocephaly, in
which electronic medical records allow for the collection of a great deal of data. However, the
sampling is highly variable from child to child. Using our new approach we are able to clearly
demonstrate that the development of pathologies related to macrocephaly is driven by both the
overall head circumference of the children as well as the velocity of their head growth.
Kevin Quinlan (Penn State)
The Construction of ε-Bad Covering Arrays
Covering Arrays are commonly used in software testing since testing all possible combinations is
often impossible. A t-covering array covers all factor level combinations when projecting the
design into any t factors. In high cost scenarios testing 100% of t factor combinations is
infeasible. In this work, the assumption that all factor level combinations for any projection must
be covered is relaxed. An ε-bad Covering Arrays covers only (1-ε)% of the factor levels
combinations required for a t-covering array across the entire design. Some theoretical bounds
for constructions of this type exist, but an explicit construction method is lacking. This work
presents the first construction method for ε-bad arrays and results in a higher coverage than a
given known result. The construction extends to an infinite number of factors, any fixed number
of factor levels, and any strength t, where no systematic method had previously existed. The
calculation of exact values of ε is detailed as well as the values when k → ∞. When the cost
of running additional experiments is high relative to the cost of missing errors, or only some
initial results are needed these designs would be preferred over traditional t-covering arrays.
Follow up experiments are done using a known construction method to obtain full t-coverage
starting from an ε-bad array for some cases. Intermediate steps of this construction can
additionally be used as Partial Covering Arrays. Finally, a case study shows the practical use of
designs of this type in a hardware security testing application.
Kyongwon Kim (Penn State)
On Post Dimension Reduction Statistical Inference
Contrasting to the rapid advances of the Sufficient Dimension Reduction methodologies, there
has been a lack of development on post dimension reduction inference. The outcome of SDR is a
set of sufficient predictors, but this is not the end of a typical data analysis process. In most
16
applications, the end product is an estimated statistical model, such as a Generalized Linear
Model or a nonlinear regression models, furnished with procedures to construct confidence
intervals and testing statistical hypothesis. However, to our knowledge, there has not been a
systematic framework to perform these tasks after sufficient predictors are obtained. The central
issue for post dimension reduction inference is to take into account both the statistical error
produced model estimation and statistical error the underlying dimension reduction method. Our
idea is to use the influence functions of statistical functionals as a vehicle to achieve this
generality. Our post dimension reduction framework is designed such that one can input the
influence functions of any dimension reduction method coupled with any estimation method to
produce the asymptotic distributions taking both processes into account.
Ling Zhang (Penn State)
Feature Screening in Ultrahigh Dimensional Varying-coefficient Cox Model
In this paper, we propose a two-stage feature screening procedure for varying-coefficient Cox
model with ultrahigh dimensional covariates. The varying-coefficient model is flexible and
powerful for modeling the dynamic effects of coefficients. In the literature, the screening
methods for varying-coefficient Cox model are mainly limited to marginal measurements.
Distinguished from the marginal screening, the proposed screening procedure is based on the
joint partial likelihood of all predictors. Through this, the proposed procedure can effectively
identify active predictors that are jointly dependent on, but marginally independent of the
response. In order to carry out the proposed procedure, we propose an efficient algorithm and
establish the ascent property of the proposed algorithm. We further prove that the proposed
procedure possesses the sure screening property: with probability tending to one, the selected
variable set includes the actual active predictors. Monte Carlo simulation is conducted to
evaluate the finite sample performance of the proposed procedure, with a comparison to SIS
procedure and sure joint screening (SJS) for Cox model. The proposed methodology is also
illustrated through an empirical analysis of one real data example.
Mengyan Li (Penn State)
Semiparametric Regression for Measurement Error Model with Heteroscedastic Error
Covariate measurement error is a common problem in many different studies. Improper
treatment of measurement errors may affect the quality of estimation and the accuracy of
inference. There has been extensive literature on the homoscedastic measurement error models.
However, heteroscedastic measurement error issue is considered to be a difficult problem with
less research available. In this paper, we consider a general parametric regression model
allowing covariate measured with heteroscedastic error. We allow both the variance function of
the measurement errors and the conditional density function of the error-prone covariate given
the error-free covariates to be completely unspecified. We treat the variance function using Bsplines approximation and propose a semiparametric estimator based on efficient score functions
to deal with the heteroscedasticity of measurement error. The resulting estimator is consistent
and enjoys good inference properties. Its finite sample performance is demonstrated through
simulation studies and a real data example.
Michelle Pistner (Penn State)
17
Synthetic Data via Quantile Regression
Statistical privacy in heavy-tailed data is a common and difficult problem, yet it has not been
extensively studied. To this end, we investigated the effectiveness of frequentist and Bayesian
quantile regression for generating heavy-tailed synthetic data as a possible privacy method in
terms of both data utility and disclosure risk. We compared these syntheses to other commonly
used models for synthetic data through simulations and applications to two census data sources.
Simulations suggest that quantile regression can outperform other methods on the basis of utility
in heavy-tailed data. Applications to real data sources suggest that quantile regression can
generate data of high utility, yet maintain some privacy in the tails of the distributions.
Sayali Phadke (Penn State)
Causal Inference on Networked Data: Applications to Social Sciences
Networked interactions among units of interest can lead to the effect of a treatment spreading to
untreated units - via their connections to treated units. The Stable Unit Treatment Value
Assumption (SUTVA) is held in conventional approaches to causal inference. It breaks down in
the presence of such spreading. Many methods of estimating treatment effect presented in the
literature incorporate the observed network of connections through a deterministic function of
the network. We will enrich this structure by modeling a stochastic function; where whether
there is spillover from one unit to the other will probabilistically depend on the structure of the
network, their treatment status, and covariates. Our approach to modeling is to break the overall
model for causal inference into two separate models; a spillover model and an outcome model.
We will use an application from the Political Science domain to illustrate the proposed
methodology.
Trinetri Ghosh (Penn State)
Flexible, Feasible and Robust Estimator of Average Causal Effect
The existing methods to estimate the average treatment effect are often based on the parametric
assumptions on treatment response models or propensity score function. But in reality, due to
misspecification of these models the average treatment effect may be influenced. Hence we
propose some robust methods which are less affected, when either plausible treatment response
models or the propensity score function are not available. We use the semiparametric efficient
score functions to estimate the propensity score function and the treatment response models. We
also studied the asymptotic properties of these robust estimators. We conduct simulation studies
when both the propensity score function and the treatment response models are correctly
specified. Then we compare the behavior of different estimators when only propensity score
function is correctly specified and the behavior of estimators when the propensity score function
is misspecified. We also studied the performance of the estimators when all models are
misspecified.
Wanjun Liu (Penn State)
Statistical methods for evaluating the correlation between timeline follow-back data and daily
process data: results from a randomized controlled trial
18
Retrospective timeline follow-back data and prospective daily process data have been frequently
collected in psychology research to characterize behavioral patterns. Although previous validity
studies have demonstrated high correlations between these two types of data, the conventional
method adopted in these studies was based on summary measures that may lose critical
information, and the Pearson’s correlation coefficient that has an undesirable property. This
study introduces the functional concordance correlation coefficient to address these issues and
provides a new R package to implement it. We use real data collected from a randomized
controlled trial to demonstrate the applications of this proposed method and compare its
analytical results with those of the conventional method. The results of this real data example
indicate that the correlations estimated by the conventional method tend to be higher than those
estimated by the proposed method. A simulation study that was designed based on the observed
real data and analytical results shows that the magnitude of overestimation associated with the
conventional method is greatest when the true correlation is about the medium size. The findings
of the real data example also imply that daily assessments are particularly beneficial for
characterizing more variable behaviors like alcohol use, whereas weekly assessments may be
sufficient for low variation events such as marijuana use.
Xiufan Yu (Penn State)
Revisiting Sufficient Forecasting: Nonparametric Estimation and Predictive Inference
The sufficient forecasting (Fan et al., 2017) provides an effective nonparametric forecasting
procedure to estimate sufficient indices from high-dimensional predictors in the presence of a
possible nonlinear forecast function. In this paper, we first revisit the sufficient forecasting and
explore its underlying connections to Fama-Macbeth regression and partial least squares. Then,
we develop an inferential theory of sufficient forecasting within the high-dimensional framework
with large cross sections, a large time dimension and a diverging number of factors. We derive
the rate of convergence of the estimated factors and loadings and characterize the asymptotic
behavior of the estimated sufficient forecasting directions without requiring the restricted
linearity condition. The predictive inference of the estimated nonparametric forecasting function
is obtained with nonparametrically estimated sufficient indices. We further demonstrate the
power of the sufficient forecasting in an empirical study of financial markets.
Xuening Zhu (Penn State)
Network Vector Autoregression
We consider here a large-scale social network with a continuous response observed for each
node at equally spaced time points. The responses from different nodes constitute an ultra-high
dimensional vector, whose time series dynamic is to be investigated. In addition, the network
structure is also taken into consideration, for which we propose a network vector autoregressive
(NAR) model. The NAR model assumes each node has response at a given time point as a linear
combination of (a) its previous value, (b) the average of its connected neighbors, (c) a set of
node-specific covariates, and (d) an independent noise. The corresponding coefficients are
referred to as the momentum effect, the network effect, and the nodal effect respectively.
Conditions for strict stationarity of the NAR models are obtained. In order to estimate the NAR
model, an ordinary least squares type estimator is developed, and its asymptotic properties are
19
investigated. We further illustrate the usefulness of the NAR model through a number of
interesting potential applications. Simulation studies and an empirical example are presented.
Yuan Ke (Penn State)
Robust factor models with covariates
We study factor models when the latent factors can be explained by observed covariates. With
those covariates, both the factors and loadings are identifiable up to a rotation matrix under finite
dimensions, and can be estimated with a faster rate of convergence. To incorporate the
explanatory power of these covariates, we propose a two-step estimation procedure: (i) regress
the data onto the observables, and (ii) take the principal components of the fitted data to estimate
the loadings and factors. The identification and estimation rely on the PCA properties of spiked
low-rank matrices, which refer to a class of matrices that are low-rank with fast-diverging
eigenvalues. We construct PCA estimators of the spiked low-rank matrix. The proposed
estimator is robust to possibly heavy-tailed distributions, which are encountered in many
applications for factor analysis. Empirically, our method leads to a substantial improvement on
the out-of-sample forecast on the US bond excess return data.
Yuji Samizo (Penn State)
Secure Statistical Analyses on Distributed Databases
Integrating multiple databases that are distributed among different data owners can be beneficial
in numerous contexts of biomedical research. But the actual sharing of data is often impeded by
concerns about data confidentiality. A situation like this require tools that can produce correct
results while preserving data privacy. In recent years, many "secure" protocols have been
proposed to solve specific statistical problems such as linear regression and classification.
However, factors such as the complexity of these protocols, inability to assess model fit, and the
lack of a platform to handle necessary data exchange have all prevented them from actually
being used in real-life situations. We present practical approaches to perform statistical analyses
securely on data held separately by multiple parties, without actually combining the data.
Zeng Li (Penn State)
On testing for high dimensional white noise
Testing for white noise is a classical yet important problem in statistics, especially for diagnostic
checks in time series modeling and linear regression. For high-dimensional time series in the
sense that the dimension p is large in relation to the sample size T, the popular
omnibus tests including the multivariate Hosking and Li-McLeod tests are extremely
conservative, leading to substantial power loss. To develop more relevant tests for highdimensional cases, we propose a portmanteau-type test statistic which is the sum of squared
singular values of the first q lagged sample autocovariance matrices. It, therefore, encapsulates
all the serial correlations (upto the time lag q) within and across all component series. Using the
tools from random matrix theory, we derive the normal limiting distributions, as both p and T
20
diverge to infinity, for this test statistic. As the actual implementation of the test requires the
knowledge of three characteristic constants of the population cross-sectional covariance matrix
and the value of the fourth moment of the standardized innovations, non trivial estimations are
proposed for these parameters and their integration leads to a practically usable test. Extensive
simulation confirms the excellent finite-sample performance of the new test with accurate size
and satisfactory power for a large range of finite (p; T) combinations, therefore ensuring wide
applicability in practice. In particular, the new tests are consistently superior to the traditional
Hosking and Li-McLeod tests.
Zheye Yuan (Penn State)
Nonlinear Support Vector Machine for Multivariate Functional Data with Applications to fMRI
and EEG Data Analysis
We propose the estimation procedures for nonlinear support vector machine (SVM) where the
predictor is a vector of random functions and the response is labels. The relation between the
response and predictor can be nonlinear and the sets of observed time points can vary from
subject to subject. The functional and nonlinear nature of the problem leads to a construction of
two functional spaces: the first representing the functional data, assumed to be a Hilbert space,
and the second characterizing nonlinear- ity, assumed to be a reproducing kernel Hilbert space. A
particularly attractive feature of our construction is that the two spaces are nested, in the sense
that the kernel for the second space is determined by the inner product of the first. We propose
multiple estimators of varying complexities with respective effectiveness in differing situations.
We apply this method to data sets on electroencephalogram (EEG) measurements for potentially
alcoholic patients and on functional magnetic resonance imaging (fMRI) measurements for
potentially attention deficit hyperactivity disorder (ADHD) patients.
21
Download