Welcome to the Library (1400x200)

Archived Talks and Seminars

Reverse Chronological Order

Statistics and Data Science Seminar
Wednesday, November 13, 2024; 11:00am in GMCS 405
Speaker Dan Gillen, Department of Statistics, University of California Irvine
Title Censoring Robust Estimation in the Nested Case-Control Study Design with Applications to Biomarker Development in AD
Abstract

Biomarkers play a critical role in the early diagnosis of disease and can serve as targets for disease interventions. As such, biomarker discovery is of primary scientific interest in many disease settings. One example of this occurs in Alzheimer’s disease (AD), a neurodegenerative disease that affects memory, thinking, and behavior. Amyloid beta (Aβ) and phosphorylated tau (p-tau) are protein biomarkers that have become key to the early diagnosis of AD. In fact, the first novel therapy since 2003, Aduhelm, recently received accelerated approval from the US FDA based on demonstrated changes in Aβ. Despite this recent success, Aβ and p-tau are not perfect discriminators of disease and, hence, biomarker discovery for time-to-progression of disease remains a primary objective in AD research. Analysis of time-to-event data using Cox's proportional hazards (PH) model is ubiquitous in the discovery process. Most commonly, a sample is taken from the population of interest and covariate information is collected on everyone. If the event of interest is rare and it is difficult or not feasible to collect full covariate information for all study participants, the nested case-control design reduces costs with minimal impact on inferential precision. However, no work has been done to investigate the performance of the nested case-control design under model mis-specification. In this talk we show that outside of the semi-parametric PH assumption, the statistical estimand under the nested case-control design will depend not only on the censoring distribution, but also on the number of controls sampled at each event time. This is true in the case of a binary covariate when the proportional hazards assumption is not satisfied, and in the case of a continuous covariate where the functional form is mis-specified. We propose estimators that allow us to recover the statistic that would have been computed under the full cohort data as well as a censoring-robust estimator. Asymptotic distributional theory for the estimators is provided along with empirical simulation results to assess finite samples properties of the estimators. We conclude with examples considering common biomarkers for AD progression using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI).

 

Statistics and Data Science Seminar
Wednesday, October 16, 2024; 11:00am in GMCS 405
Speaker Loki Natarajan, Professor Emerita of Public Health, University of California San Diego
Title Wearable sensors to monitor physical activity: statistical approaches and challenges
Abstract

Physical activity and sedentary behavior are known to impact health and well-being. Wearable sensors, such as accelerometers are widely used for tracking human movement and provide estimates of activity every minute (or at even finer granularity, e.g., 30 Hz). Most research studies aggregate these activity records to daily or weekly summary statistics. However, this aggregation can result in a loss of information. Statistical methods for leveraging the full spectrum of accelerometer time series have been the focus of much recent research. In this talk, we will discuss two specific approaches, Functional data analysis and Machine learning classification methods, for modeling accelerometer-derived activity profiles, and demonstrate their applications in public health.

 

Statistics and Data Science Seminar
Wednesday, October 2, 2024; 11:00am
Speaker Rob Malouf, Professor of Linguistics, San Diego State University
Title Constructions in language change: the case of English help/help to
Abstract

This talk explores the application of statistical methods to the study of language as a complex system, focusing on linguistic variation and change through the specific case of "help" and "help to" constructions in English. Viewing language as a complex adaptive system characterized by interactions between speakers, linguistic structures, and social contexts, we employ a large historical corpus and Bayesian mixed-effects regression to identify factors influencing the development of these constructions change over time. The results reveal two distinct patterns of change in the usage of these constructions, demonstrating the value of modern quantitative methods in the study of language dynamics and the mechanisms of language change.

 

Statistics and Data Science Seminar
Wednesday, September 11, 2024; 11:00am
Speaker Hajar Homayouni, Professor of Computer Science, San Diego State University
Title Anomaly Detection and Interpretation from Tabular Data Using Transformer Architecture
Abstract

This project introduces a novel anomaly detection and interpretation approach utilizing a transformer-based architecture that reduces preprocessing needs by converting tabular data rows into a sentence-like structure and generates explanations for anomalies in the form of rule violations. Our approach consists of two main components: the Anomaly Detector and the Anomaly Interpreter. The Anomaly Detector utilizes a Transformer model with a customized embedding layer tailored for tabular data structures. While attention weights do not directly explain the model's reasoning, they can provide valuable insights when interpreted carefully. The Anomaly Interpreter uses attention weights to identify potential issues in anomalous data by comparing these weights with patterns in normal data. When the model labels a row as anomalous, the Interpreter examines closely related columns and compares their associations with those in the normal dataset using a benchmark association matrix. Deviations from typical associations are flagged as potential rule violations, highlighting unusual column pair relationships. We evaluated our approach against conventional methods, such as Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks, using labeled data from the Outlier Detection DataSets (ODDS). Our evaluation, which included standard metrics along with a novel mutation analysis technique, demonstrates that our method achieves accuracy comparable to existing techniques, while additionally providing interpretations of detected anomalies.

 

Statistics and Data Science Seminar
Wednesday, September 4, 2024; 11:00am
Speaker Armin Schwartzman, Professor of Statistics, UC San Diego
Title Estimating the fraction of variance of traits explained by high-dimensional genetic predictors
Abstract

The fraction of variance explained (FVE) by a model is a measure of the total amount of information for an outcome contained in the predictor variables. In Genome-Wide Association Studies (GWAS), the FVE of a trait by the SNPs (genetic loci) in the study is called SNP heritability. Because the number of predictors is much larger than the number of subjects, a classical regression model cannot be fitted and the effects of specific loci are difficult to identify. In this talk I give an overview of the main existing methods (others' and our own) to estimate the FVE in these high-dimensional regression models.

 

Statistics and Data Science Seminar
Wednesday, April 24, 2024; 11:00am
Speaker Daniel R. Jeske, Professor and Vice Provost, University of California, Riverside
Title On Combining Two Estimators with an Application to the Estimation of a Responder/Non-Responder Treatment Effect
Abstract

It is well-known how to combine two independent unbiased estimators of the same parameter. The independence of the estimators is often satisfied from the fact the two estimators are calculated from separate studies. In this paper, the performance of an optimal combined estimator when the estimators are correlated and biased is compared to the simple arithmetic average of the two estimators, and the effect of estimating the optimal combining weight is investigated. The advantage of combining two dependent and biased estimators is demonstrated in the context of estimating a responder/non-responder treatment effect in a randomized clinical trial.

 

Statistics Seminar
Wednesday, March 20, 2024; 11:00am
Speaker Xin Zhang, Assistant Professor, SDSU Computer Science Department
Title Enabling Urban Intelligence by Harnessing Human-Generated Spatial-Temporal Data
Abstract

Technology advancement in mobile sensing and communication has enabled a massive amount of mobility data to be generated from human decision-makers, which we call human-generated spatial-temporal data (HSTD). Applying the HSTD to extract the unique decision-making strategies of human agents and design human-centered urban intelligent systems (e.g., self-driving ride services) has transformative potential. They can not only promote the individual well-being of gig-workers, and improve service quality and revenue of transportation service providers, but also enable downstream applications in smart transit planning, efficient gig-work dispatching, safe autonomous vehicle (AV) routing, and so on.

However, analyzing human decision strategies from HSTD is a challenging task. Human behaviors are complex and vary in different geographical locations (i.e., spatial challenge), and the quality of the learned strategies is also dependent upon the model expressibility (i.e., theoretical challenge). In addition, leveraging human decisions for human-centered smart cities has some practical gaps.

This talk presents a picture of my work on human behavior analysis from HSTD based on imitation learning and its downstream applications. They focus on tackling the above challenges by providing solutions to the following research questions: (1) How to capture the unique human decision-making strategies leveraging HSTD? (2) How to design human-centered smart city services leveraging human decisions? By answering these questions, a series of works including cGAIL, f-GAIL and CAC are introduced with novel designs in problem formulation, model architecture, and algorithm. Extensive experiments support the effectiveness of the proposed models on human behavior analysis and self-driving decision-making from HSTD, and provide superior performance over state-of-the-art works.

 

Statistics Seminar
Wednesday, March 6, 2024; 11:00am
Speaker Hossein Shirazi, Assistant Professor, SDSU Management and Information Systems
Title Seeing Should Probably Not Be Believing: The Role of Deceptive Support in COVID-19 Misinformation on Twitter
Abstract

With the spread of the SARS-CoV-2, enormous amounts of information about the pandemic are disseminated through social media platforms such as Twitter. Social media posts often leverage the trust readers have in prestigious news agencies and cite news articles as a way of gaining credibility. Nevertheless, it is not always the case that the cited article supports the claim made in the social media post. We present a cross-genre ad hoc pipeline to identify whether the information in a Twitter post (i.e., a "Tweet") is indeed supported by the cited news article. Our approach is empirically based on a corpus of over 46.86 million Tweets and is divided into two tasks: (i) the development of models to detect Tweets containing claims and worth to be fact-checked and (ii) verifying whether the claims made in a Tweet are supported by the newswire article it cites. Unlike previous studies that detect unsubstantiated information by post hoc analysis of the patterns of propagation, we seek to identify reliable support (or the lack of it) before the misinformation begins to spread. We discover that nearly half of the Tweets (43.4%) are not factual and hence not worth checking—a significant filter, given the sheer volume of social media posts on a platform such as Twitter. Moreover, we find that among the Tweets that contain a seemingly factual claim while citing a news article as supporting evidence, at least 1% are not supported by the cited news and are hence misleading..

 

Statistics Seminar
Wednesday, February 28, 2024; 11:00am
Speaker Mary Meyer, Ph.D., Professor of Statistics, Colorado State University
Title Applications of constrained spline density estimation
Abstract

Density estimation methods often involve kernels, but there are advantages to using splines. Especially if the shape of the density is known to be decreasing, or unimodal, or bimodal, or if the shape of the density is the research question, splines allow the shape assumptions to be readily implemented. In addition, spline estimators enjoy a faster convergence rate compared to kernel density estimators. Applications include testing unimodal versus multimodal density, estimating a deconvolution density, robust regression, and testing for sampling bias.

 

Statistics Seminar
Tuesday, February 13, 2024; 11:00am
Speaker Veronica Berrocal, Ph.D., Statistics, UC Irvine
Title Bayesian non-stationary spatial modeling using shrinkage priors
Abstract

Any spatial statistical analysis often starts with a decision regarding how to model the spatial dependence structure: can the spatial process be thought of as stationary or non-stationary? While most parametric covariance functions assume stationarity, in the case of non-stationarity, a modeling choice could be to envision the process as globally non-stationary, but locally stationary. A drawback of this choice lies in the fact that identifying regions of non-stationarity remains still challenging, at least from a computational point of view. In this talk, we present two approaches that allow to identify regions of local stationarity by redefining and repurposing the Multi-Resolution Approximation (MRA) of Katzfuss (2017), which was introduced to lessen the computational burden encountered when analyzing large massive data. Both methods use the representation of the spatial process as a linear combination of appropriate basis function (the MRA basis functions), but differ in the shrinkage prior specification adopted for the basis function weights. Inference on the basis function weights and the spatial variability in the number of levels of resolutions needed provide information on whether the process can be considered stationary or not. We showcase the ability of these methods to correctly capture regions of local stationarity through simulation experiments. We also apply them to identify regions with different strengths of spatial dependence for two soil-related variables that are very important for climate sciences, soil organic carbon and soil moisture.

 

Statistics Seminar
Wednesday, November 29, 2023; 11:00am
Speaker Professor Qingyun (Serena) Zhu, Management Information Systems, SDSU
Title How Loud is Consumer Voice in Product Deletion Decisions? Retail Analytic Insights
Abstract

This study examines the role of online consumer reviews in product deletion decisions. Building upon product portfolio management theory we integrate consumer voice, represented by online consumer review behavior into organizational voice-strategic product deletion decision-making. The study also informs demand management where findings suggest that products with lower attribute ratings and comments having less relevance to lower-ranked attributes are more likely to be deleted. The linguistic retail analytic characteristics of online reviews also provide insights for product deletion decisions. Products with reviews having higher subjectivity, shorter length, and lower readability are more likely to be deleted. Pre-purchase consumer voice - the perceived helpfulness or unhelpfulness of online reviews also informed product deletion decisions. A general conclusion is that online reviews can provide important retail analytics for smarter retail operations planning at the strategic and tactical levels when it comes to product planning and portfolio management through product deletion.

 

Statistics Seminar
Wednesday, November 1, 2023; 11:00am
Speaker Dr. Johanna Hardin, Mathematics and Statistics Department, Pomona College
Title Technical Conditions in Normalizing ChIP-Seq Data
Abstract

ChIP-Seq (Chromatin immunoprecipitation followed by sequencing) data is widely used for studying the behavior of genome-wide protein-DNA interactions. One important biological question addressed by ChIP-Seq experiments is: does the amount of bound protein (at a particular region on the genome) change for different experimental conditions, i.e., is the region differentially bound? Standard statistical methods for finding differentially bound regions derive from well-known two sample tests (think: t-test), but statistical analyses require samples to be pre-normalized. In this talk, I will discuss the challenge of normalization, the methods for normalizing, and the technical conditions required for the normalization method to work. Simulation studies back up our work on deriving technical conditions from the format of the normalization methods.

 

Statistics Seminar
Wednesday, October 11, 2023; 11:00am
Speaker Dr. Ronghui (Lily) Xu, Department of Mathematics and School of Public Health, UC San Diego
Title Doubly robust estimation for time-to-event outcomes
Abstract

In this talk we review our works on doubly robust estimation for time-to-event outcomes, including the popular marginal structural Cox model and for dependently left truncated data. A common theme to these works is the well-known semiparametric theory, and a notable feature is the rate double robustness which allows machine learning or nonparametric approaches to be applied in order to estimate the nuisance parameters or functions. The latter circumvents compatibility issues surrounding nonlinear models like the proportional hazards one. Our main estimand of interest is a treatment effect, with or without randomization.

 

Statistics Seminar
Wednesday, September 27, 2023; 11:00am
Speaker Dr. Zhe Fei, Department of Statistics, UC Riverside
Title U-learning for Prediction Inference: With Applications to LASSO and Deep Neural Networks
Abstract

Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional predictors, presents challenges. We introduce a new U-learning approach for making ensemble predictions and constructing prediction intervals for continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hajek projection for deriving the asymptotic properties and yielding consistent variance estimates. We applied our approach with two commonly used predictive algorithms, the Lasso and deep neural networks (DNNs), and illustrate valid prediction inferences with extensive numeric examples. We applied the methods to predict the DNA methylation age of patients with different tissue samples, which may properly characterize the aging process and lead to anti-aging interventions.

 

Statistics Seminar
Wednesday, September 13, 2023; 11:00am
Speaker Dr. Xin Wang, Assistant Professor, Department of Mathematics and Statistics, SDSU
Title Clustered coefficient regression models for Poisson process with an application to seasonal warranty claim data
Abstract

Motivated by a product warranty claims data set, we propose clustered coefficient regression models in a non-homogeneous Poisson process for recurrent event data. The proposed method, referred as CLUPP, can estimate the group structure and parameters simultaneously. In our proposed method, a penalized regression approach is used to identify the group structure. Numerical studies show that the proposed approach can identify the group structure well, and outperforms traditional methods such as hierarchical clustering and K-means. We also establish theoretical properties, which show that the proposed estimators can converge to true parameters in high probability. In the end, we apply our proposed methods to the product warranty claims data set, which achieve better prediction than the state-of-the-art methods.