Archived Talks and Seminars
Reverse Chronological Order
Statistics and Data Science Seminar | |
Wednesday, September 11, 2024; 11:00am | |
Speaker | Hajar Homayouni, Professor of Computer Science, San Diego State University |
Title | Anomaly Detection and Interpretation from Tabular Data Using Transformer Architecture |
Abstract |
This project introduces a novel anomaly detection and interpretation approach utilizing a transformer-based architecture that reduces preprocessing needs by converting tabular data rows into a sentence-like structure and generates explanations for anomalies in the form of rule violations. Our approach consists of two main components: the Anomaly Detector and the Anomaly Interpreter. The Anomaly Detector utilizes a Transformer model with a customized embedding layer tailored for tabular data structures. While attention weights do not directly explain the model's reasoning, they can provide valuable insights when interpreted carefully. The Anomaly Interpreter uses attention weights to identify potential issues in anomalous data by comparing these weights with patterns in normal data. When the model labels a row as anomalous, the Interpreter examines closely related columns and compares their associations with those in the normal dataset using a benchmark association matrix. Deviations from typical associations are flagged as potential rule violations, highlighting unusual column pair relationships. We evaluated our approach against conventional methods, such as Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks, using labeled data from the Outlier Detection DataSets (ODDS). Our evaluation, which included standard metrics along with a novel mutation analysis technique, demonstrates that our method achieves accuracy comparable to existing techniques, while additionally providing interpretations of detected anomalies. |
Statistics and Data Science Seminar | |
Wednesday, September 4, 2024; 11:00am | |
Speaker | Armin Schwartzman, Professor of Statistics, UC San Diego |
Title | Estimating the fraction of variance of traits explained by high-dimensional genetic predictors |
Abstract |
The fraction of variance explained (FVE) by a model is a measure of the total amount of information for an outcome contained in the predictor variables. In Genome-Wide Association Studies (GWAS), the FVE of a trait by the SNPs (genetic loci) in the study is called SNP heritability. Because the number of predictors is much larger than the number of subjects, a classical regression model cannot be fitted and the effects of specific loci are difficult to identify. In this talk I give an overview of the main existing methods (others' and our own) to estimate the FVE in these high-dimensional regression models. |
Statistics and Data Science Seminar | |
Wednesday, April 24, 2024; 11:00am | |
Speaker | Daniel R. Jeske, Professor and Vice Provost, University of California, Riverside |
Title | On Combining Two Estimators with an Application to the Estimation of a Responder/Non-Responder Treatment Effect |
Abstract |
It is well-known how to combine two independent unbiased estimators of the same parameter. The independence of the estimators is often satisfied from the fact the two estimators are calculated from separate studies. In this paper, the performance of an optimal combined estimator when the estimators are correlated and biased is compared to the simple arithmetic average of the two estimators, and the effect of estimating the optimal combining weight is investigated. The advantage of combining two dependent and biased estimators is demonstrated in the context of estimating a responder/non-responder treatment effect in a randomized clinical trial. |
Statistics Seminar | |
Wednesday, March 20, 2024; 11:00am | |
Speaker | Xin Zhang, Assistant Professor, SDSU Computer Science Department |
Title | Enabling Urban Intelligence by Harnessing Human-Generated Spatial-Temporal Data |
Abstract |
Technology advancement in mobile sensing and communication has enabled a massive amount of mobility data to be generated from human decision-makers, which we call human-generated spatial-temporal data (HSTD). Applying the HSTD to extract the unique decision-making strategies of human agents and design human-centered urban intelligent systems (e.g., self-driving ride services) has transformative potential. They can not only promote the individual well-being of gig-workers, and improve service quality and revenue of transportation service providers, but also enable downstream applications in smart transit planning, efficient gig-work dispatching, safe autonomous vehicle (AV) routing, and so on. However, analyzing human decision strategies from HSTD is a challenging task. Human behaviors are complex and vary in different geographical locations (i.e., spatial challenge), and the quality of the learned strategies is also dependent upon the model expressibility (i.e., theoretical challenge). In addition, leveraging human decisions for human-centered smart cities has some practical gaps. This talk presents a picture of my work on human behavior analysis from HSTD based on imitation learning and its downstream applications. They focus on tackling the above challenges by providing solutions to the following research questions: (1) How to capture the unique human decision-making strategies leveraging HSTD? (2) How to design human-centered smart city services leveraging human decisions? By answering these questions, a series of works including cGAIL, f-GAIL and CAC are introduced with novel designs in problem formulation, model architecture, and algorithm. Extensive experiments support the effectiveness of the proposed models on human behavior analysis and self-driving decision-making from HSTD, and provide superior performance over state-of-the-art works. |
Statistics Seminar | |
Wednesday, March 6, 2024; 11:00am | |
Speaker | Hossein Shirazi, Assistant Professor, SDSU Management and Information Systems |
Title | Seeing Should Probably Not Be Believing: The Role of Deceptive Support in COVID-19 Misinformation on Twitter |
Abstract |
With the spread of the SARS-CoV-2, enormous amounts of information about the pandemic are disseminated through social media platforms such as Twitter. Social media posts often leverage the trust readers have in prestigious news agencies and cite news articles as a way of gaining credibility. Nevertheless, it is not always the case that the cited article supports the claim made in the social media post. We present a cross-genre ad hoc pipeline to identify whether the information in a Twitter post (i.e., a "Tweet") is indeed supported by the cited news article. Our approach is empirically based on a corpus of over 46.86 million Tweets and is divided into two tasks: (i) the development of models to detect Tweets containing claims and worth to be fact-checked and (ii) verifying whether the claims made in a Tweet are supported by the newswire article it cites. Unlike previous studies that detect unsubstantiated information by post hoc analysis of the patterns of propagation, we seek to identify reliable support (or the lack of it) before the misinformation begins to spread. We discover that nearly half of the Tweets (43.4%) are not factual and hence not worth checking—a significant filter, given the sheer volume of social media posts on a platform such as Twitter. Moreover, we find that among the Tweets that contain a seemingly factual claim while citing a news article as supporting evidence, at least 1% are not supported by the cited news and are hence misleading.. |
Statistics Seminar | |
Wednesday, February 28, 2024; 11:00am | |
Speaker | Mary Meyer, Ph.D., Professor of Statistics, Colorado State University |
Title | Applications of constrained spline density estimation |
Abstract |
Density estimation methods often involve kernels, but there are advantages to using splines. Especially if the shape of the density is known to be decreasing, or unimodal, or bimodal, or if the shape of the density is the research question, splines allow the shape assumptions to be readily implemented. In addition, spline estimators enjoy a faster convergence rate compared to kernel density estimators. Applications include testing unimodal versus multimodal density, estimating a deconvolution density, robust regression, and testing for sampling bias. |
Statistics Seminar | |
Tuesday, February 13, 2024; 11:00am | |
Speaker | Veronica Berrocal, Ph.D., Statistics, UC Irvine |
Title | Bayesian non-stationary spatial modeling using shrinkage priors |
Abstract |
Any spatial statistical analysis often starts with a decision regarding how to model the spatial dependence structure: can the spatial process be thought of as stationary or non-stationary? While most parametric covariance functions assume stationarity, in the case of non-stationarity, a modeling choice could be to envision the process as globally non-stationary, but locally stationary. A drawback of this choice lies in the fact that identifying regions of non-stationarity remains still challenging, at least from a computational point of view. In this talk, we present two approaches that allow to identify regions of local stationarity by redefining and repurposing the Multi-Resolution Approximation (MRA) of Katzfuss (2017), which was introduced to lessen the computational burden encountered when analyzing large massive data. Both methods use the representation of the spatial process as a linear combination of appropriate basis function (the MRA basis functions), but differ in the shrinkage prior specification adopted for the basis function weights. Inference on the basis function weights and the spatial variability in the number of levels of resolutions needed provide information on whether the process can be considered stationary or not. We showcase the ability of these methods to correctly capture regions of local stationarity through simulation experiments. We also apply them to identify regions with different strengths of spatial dependence for two soil-related variables that are very important for climate sciences, soil organic carbon and soil moisture. |
Statistics Seminar | |
Wednesday, November 29, 2023; 11:00am | |
Speaker | Professor Qingyun (Serena) Zhu, Management Information Systems, SDSU |
Title | How Loud is Consumer Voice in Product Deletion Decisions? Retail Analytic Insights |
Abstract |
This study examines the role of online consumer reviews in product deletion decisions. Building upon product portfolio management theory we integrate consumer voice, represented by online consumer review behavior into organizational voice-strategic product deletion decision-making. The study also informs demand management where findings suggest that products with lower attribute ratings and comments having less relevance to lower-ranked attributes are more likely to be deleted. The linguistic retail analytic characteristics of online reviews also provide insights for product deletion decisions. Products with reviews having higher subjectivity, shorter length, and lower readability are more likely to be deleted. Pre-purchase consumer voice - the perceived helpfulness or unhelpfulness of online reviews also informed product deletion decisions. A general conclusion is that online reviews can provide important retail analytics for smarter retail operations planning at the strategic and tactical levels when it comes to product planning and portfolio management through product deletion. |
Statistics Seminar | |
Wednesday, November 1, 2023; 11:00am | |
Speaker | Dr. Johanna Hardin, Mathematics and Statistics Department, Pomona College |
Title | Technical Conditions in Normalizing ChIP-Seq Data |
Abstract |
ChIP-Seq (Chromatin immunoprecipitation followed by sequencing) data is widely used for studying the behavior of genome-wide protein-DNA interactions. One important biological question addressed by ChIP-Seq experiments is: does the amount of bound protein (at a particular region on the genome) change for different experimental conditions, i.e., is the region differentially bound? Standard statistical methods for finding differentially bound regions derive from well-known two sample tests (think: t-test), but statistical analyses require samples to be pre-normalized. In this talk, I will discuss the challenge of normalization, the methods for normalizing, and the technical conditions required for the normalization method to work. Simulation studies back up our work on deriving technical conditions from the format of the normalization methods. |
Statistics Seminar | |
Wednesday, October 11, 2023; 11:00am | |
Speaker | Dr. Ronghui (Lily) Xu, Department of Mathematics and School of Public Health, UC San Diego |
Title | Doubly robust estimation for time-to-event outcomes |
Abstract |
In this talk we review our works on doubly robust estimation for time-to-event outcomes, including the popular marginal structural Cox model and for dependently left truncated data. A common theme to these works is the well-known semiparametric theory, and a notable feature is the rate double robustness which allows machine learning or nonparametric approaches to be applied in order to estimate the nuisance parameters or functions. The latter circumvents compatibility issues surrounding nonlinear models like the proportional hazards one. Our main estimand of interest is a treatment effect, with or without randomization. |
Statistics Seminar | |
Wednesday, September 27, 2023; 11:00am | |
Speaker | Dr. Zhe Fei, Department of Statistics, UC Riverside |
Title | U-learning for Prediction Inference: With Applications to LASSO and Deep Neural Networks |
Abstract |
Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional predictors, presents challenges. We introduce a new U-learning approach for making ensemble predictions and constructing prediction intervals for continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hajek projection for deriving the asymptotic properties and yielding consistent variance estimates. We applied our approach with two commonly used predictive algorithms, the Lasso and deep neural networks (DNNs), and illustrate valid prediction inferences with extensive numeric examples. We applied the methods to predict the DNA methylation age of patients with different tissue samples, which may properly characterize the aging process and lead to anti-aging interventions. |
Statistics Seminar | |
Wednesday, September 13, 2023; 11:00am | |
Speaker | Dr. Xin Wang, Assistant Professor, Department of Mathematics and Statistics, SDSU |
Title | Clustered coefficient regression models for Poisson process with an application to seasonal warranty claim data |
Abstract |
Motivated by a product warranty claims data set, we propose clustered coefficient regression models in a non-homogeneous Poisson process for recurrent event data. The proposed method, referred as CLUPP, can estimate the group structure and parameters simultaneously. In our proposed method, a penalized regression approach is used to identify the group structure. Numerical studies show that the proposed approach can identify the group structure well, and outperforms traditional methods such as hierarchical clustering and K-means. We also establish theoretical properties, which show that the proposed estimators can converge to true parameters in high probability. In the end, we apply our proposed methods to the product warranty claims data set, which achieve better prediction than the state-of-the-art methods. |