skip to main content

Welcome to the Library (1400x200)

Archived Talks and Seminars

Reverse Chronological Order

Statistics and Data Science Seminar
Wednesday, April 9, 2025; 10:30am
Speaker	Amber Puha, Department of Mathematics, Cal State San Marcos
Title	Stationary Behavior of a Diffusion Limit for SRPT Queues with Heavy Tailed Processing Time Distributions
Abstract	In this talk, we explore the stationary behavior of diffusion limits in shortest remaining processing time (SRPT) queues with heavy-tailed processing times. The SRPT state descriptor is a measure-valued process that at each time has unit masses at the remaining processing time of each job in system. Banerjee, Budhiraja and Puha (2022) have shown that, under proper scaling, this state descriptor converges to a measure-valued stochastic process, characterized by workload processes modeled as reflecting coupled Brownian motions with a specific negative drift function. Motivated by the form of the limit, we study reflecting coupled Brownian motions with a general nondecreasing negative drift function. We analyze the stationary behaviors of the distributions of the resulting measure-valued process and its moments through the maximum process of the coupled Brownian motions. Additionally, we derive joint distributions and the covariance structure of the maximum process, offering new insights into the stationary distribution of the queue length in SRPT systems.

Statistics and Data Science Seminar
Wednesday, March 19, 2025; 11:00am
Speaker	Adir Mancebo, Science Manager, Data Science Alliance
Title	Introduction to the Framework for Responsible Data Science Practices
Abstract	This seminar introduces the Data Science Alliance's Framework for Responsible Data Science Practices, a comprehensive guide developed in collaboration with experienced data professionals from various industries and academia. The presentation establishes fundamental concepts of responsible data science and presents the four guiding principles that form its foundation. It illustrates the importance of ethical considerations in data science and AI applications, demonstrating how these powerful tools can impact society. Through an overview of the data project lifecycle and examination of real-world case studies, the session highlights the consequences of neglecting responsible practices. Interactive discussions encourage participants to consider practical applications of responsible data science principles across different project phases. Attendees will gain valuable insights into implementing responsible practices that maximize benefits while minimizing potential harm in their data science initiatives.

Statistics and Data Science Seminar
Wednesday, March 5, 2025; 11:00am
Speaker	Victoria Delaney, San Diego State University
Title	Using Statistics to Understand AI, and Using AI to Understand Statistics: Teaching and Learning in the Age of AI
Abstract	Today’s youth must understand ideas about statistics, data, algorithms, and complex technologies to a far greater degree than generations before them. Having a coordinated understanding of these ideas will better enable them to participate thoughtfully in society, remain competitive for the job market, and critique AI tools if and when appropriate. However, many youth do not receive this type of statistical, data-driven education, or are even aware of how it might benefit them in the first place. It is my belief that math educators must help students see value in learning about statistics, data science, and AI from a much earlier age. Teaching with and about AI invites new pedagogical and curricular challenges. In my presentation, I will review two. First, I illustrate challenges that high school statistics teachers faced when teaching students about machine learning classification tasks in the context of facial recognition. Students generally understood that machine learning involves modeling with data. They struggled with the freedom they were given when designing classification models, as it involved nonroutine thinking and iterative testing that typical high school math curricula do not incorporate. In my second example, I present cases of college undergraduates solving introductory statistics tasks with AI chatbots. I showcase the example of Nadya, who demonstrated correct probabilistic reasoning on the task confidently, until the AI chatbot’s output caused her to second guess her answer. Both studies illustrate tensions that may arise when teaching or learning about statistics, data, and AI. I argue that today’s teachers must confront these tensions with novel forms of pedagogy, course design, curricular tasks, and deliberate attention to the impact that AI will have on statistics education.

Statistics and Data Science Seminar
Wednesday, February 19, 2025; 11:00am
Speaker	Nadia B. Mendoza, San Diego State University
Title	Imputation Methods for Incomplete Data with an Application to Improve Accuracy when Predicting HIV Status
Abstract	Standard statistical analyses often exclude incomplete observations, which can be particularly problematic when predicting rare outcomes. Taking into account the missing data mechanisms, imputation methods can be used to recover the incomplete cases and include them in models and analysis. Four imputation methods with different approaches are presented and evaluated in this study - amelia, hmisc, mice and missForest (all available in R). In the linkage to HIV care dataset, there were initially 553 complete HIV positive cases, an additional 554 cases were added through imputation. Simulations were conducted across various scenarios using the complete data to guide imputation for the full dataset. A random forest model was used to predict HIV status, assessing imputation precision, overall prediction accuracy, and sensitivity. While missForest produced imputed values closer to the observed ones, this did not translate into better predictive models. Hmisc and mice imputations led to higher prediction accuracy and sensitivity, with median accuracy increasing from 64% to 76% and median sensitivity rising from 0.4 to 0.75. Hmisc and amelia were the fastest imputation methods. Additionally, oversampling the minority class combined with undersampling the majority class did not improve predictions of new HIV positive cases using only the complete observations. However, increasing the minority class information through imputation enhanced sensitivity for predicting cases in this class.

Statistics and Data Science Seminar
Wednesday, February 5, 2025; 11:00am
Speaker	Jonathan Helm, Associate Professor of Psychology, San Diego State University
Title	Small Sample Solutions to Structural Equation Models for Psychology Research
Abstract	In psychology research, structural equation models (SEM) are often used to test hypotheses. Typically, a researcher using SEM will specify a set of hypothesized relations (i.e., a model) amongst a set of variables, assume that those variables follow a multivariate normal distribution, and estimate the relations implied by the model using maximum likelihood. Furthermore, psychology researchers typically compare models using a likelihood ratio test, which compares the difference in the log-likelihoods (multiplied by -2) to a chi-square distribution with degrees of freedom equal to the difference in number of estimated parameters across models (i.e., Wilks’ theorem). One drawback of the likelihood ratio test concerns its reliance on asymptotic theory (the chi-square distribution is the limiting distribution as sample size approaches infinity), and therefore p-values from the test tend to be underestimated in small samples, leading to increased Type-1 error rates. Over the years, psychology researchers have attempted to both (a) provide minimum sample size recommendations, and (b) provide various adjustments to the likelihood ratio test to account for sample size. However, minimum sample size requirements vary depending on the proposed model (i.e., there is no ‘one-size-fits all’), and the proposed adjustments have mixed results (again, they work well for some, but not all models). In this talk, I will propose an alternative strategy to both understand and find a solution for the small sample size issue that persists for SEM. In particular, I will attempt to generalize the application of Wilks’ Lambda (i.e., the ratio of determinants of two matrices; wholly separate from Wilks’ theorem above) to SEM, recognizing that Wilks’ Lambda distribution can account for small samples (when data stem from multivariate normal distributions). The application appears to work well in some simple cases by exactly replicating results (e.g., t-tests, correlation, regression, and ANOVA). However, confusion arises for more complex cases. In particular, it is not entirely clear how to generalize the concept of degrees of freedom from the simple to complex cases, and I am hoping the group may have feedback or insights to share.

Statistics and Data Science Seminar
Wednesday, December 4, 2024; 11:00am (virtual)
Speaker	Gabriela Fernandez, Department of Geography, San Diego State University
Title	Metabolism of Cities Living Lab at San Diego State University
Bio	Dr. Fernandez is the Director of the Metabolism of Cities Living Lab (MOC-LLAB) under the Center for Human Dynamics in the Mobile Age at San Diego State University (SDSU) working to localize the United Nations Sustainable Development Goals in southern California, the Baja California Mexico region, and globally. Dr. Fernandez is a Professor and Graduate Advisor in the Department of Geography’s Master of Science in Big Data Analytics Program at SDSU. She received a Ph.D. in Urban Planning, Design, and Policy from the Department of Architecture and Urban Studies at the Politecnico di Milano in Milan, Italy. She received a B.S. in Public Administration with an emphasis in City Planning and a Master of City Planning at SDSU. Her research interests include urban metabolism ideologies and material flow analysis of metropolitan cities. Identifying urban typologies and socioeconomic indicators in the urban context while promoting public policy, smart cities, big data analytics, circular economy methods, UN SDGs, and under-representative populations.

Statistics and Data Science Seminar
Wednesday, November 13, 2024; 11:00am in GMCS 405
Speaker	Dan Gillen, Department of Statistics, University of California Irvine
Title	Censoring Robust Estimation in the Nested Case-Control Study Design with Applications to Biomarker Development in AD
Abstract	Biomarkers play a critical role in the early diagnosis of disease and can serve as targets for disease interventions. As such, biomarker discovery is of primary scientific interest in many disease settings. One example of this occurs in Alzheimer’s disease (AD), a neurodegenerative disease that affects memory, thinking, and behavior. Amyloid beta (Aβ) and phosphorylated tau (p-tau) are protein biomarkers that have become key to the early diagnosis of AD. In fact, the first novel therapy since 2003, Aduhelm, recently received accelerated approval from the US FDA based on demonstrated changes in Aβ. Despite this recent success, Aβ and p-tau are not perfect discriminators of disease and, hence, biomarker discovery for time-to-progression of disease remains a primary objective in AD research. Analysis of time-to-event data using Cox's proportional hazards (PH) model is ubiquitous in the discovery process. Most commonly, a sample is taken from the population of interest and covariate information is collected on everyone. If the event of interest is rare and it is difficult or not feasible to collect full covariate information for all study participants, the nested case-control design reduces costs with minimal impact on inferential precision. However, no work has been done to investigate the performance of the nested case-control design under model mis-specification. In this talk we show that outside of the semi-parametric PH assumption, the statistical estimand under the nested case-control design will depend not only on the censoring distribution, but also on the number of controls sampled at each event time. This is true in the case of a binary covariate when the proportional hazards assumption is not satisfied, and in the case of a continuous covariate where the functional form is mis-specified. We propose estimators that allow us to recover the statistic that would have been computed under the full cohort data as well as a censoring-robust estimator. Asymptotic distributional theory for the estimators is provided along with empirical simulation results to assess finite samples properties of the estimators. We conclude with examples considering common biomarkers for AD progression using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI).

Statistics and Data Science Seminar
Wednesday, October 16, 2024; 11:00am in GMCS 405
Speaker	Loki Natarajan, Professor Emerita of Public Health, University of California San Diego
Title	Wearable sensors to monitor physical activity: statistical approaches and challenges
Abstract	Physical activity and sedentary behavior are known to impact health and well-being. Wearable sensors, such as accelerometers are widely used for tracking human movement and provide estimates of activity every minute (or at even finer granularity, e.g., 30 Hz). Most research studies aggregate these activity records to daily or weekly summary statistics. However, this aggregation can result in a loss of information. Statistical methods for leveraging the full spectrum of accelerometer time series have been the focus of much recent research. In this talk, we will discuss two specific approaches, Functional data analysis and Machine learning classification methods, for modeling accelerometer-derived activity profiles, and demonstrate their applications in public health.

Statistics and Data Science Seminar
Wednesday, October 2, 2024; 11:00am
Speaker	Rob Malouf, Professor of Linguistics, San Diego State University
Title	Constructions in language change: the case of English help/help to
Abstract	This talk explores the application of statistical methods to the study of language as a complex system, focusing on linguistic variation and change through the specific case of "help" and "help to" constructions in English. Viewing language as a complex adaptive system characterized by interactions between speakers, linguistic structures, and social contexts, we employ a large historical corpus and Bayesian mixed-effects regression to identify factors influencing the development of these constructions change over time. The results reveal two distinct patterns of change in the usage of these constructions, demonstrating the value of modern quantitative methods in the study of language dynamics and the mechanisms of language change.

Statistics and Data Science Seminar
Wednesday, September 11, 2024; 11:00am
Speaker	Hajar Homayouni, Professor of Computer Science, San Diego State University
Title	Anomaly Detection and Interpretation from Tabular Data Using Transformer Architecture
Abstract	This project introduces a novel anomaly detection and interpretation approach utilizing a transformer-based architecture that reduces preprocessing needs by converting tabular data rows into a sentence-like structure and generates explanations for anomalies in the form of rule violations. Our approach consists of two main components: the Anomaly Detector and the Anomaly Interpreter. The Anomaly Detector utilizes a Transformer model with a customized embedding layer tailored for tabular data structures. While attention weights do not directly explain the model's reasoning, they can provide valuable insights when interpreted carefully. The Anomaly Interpreter uses attention weights to identify potential issues in anomalous data by comparing these weights with patterns in normal data. When the model labels a row as anomalous, the Interpreter examines closely related columns and compares their associations with those in the normal dataset using a benchmark association matrix. Deviations from typical associations are flagged as potential rule violations, highlighting unusual column pair relationships. We evaluated our approach against conventional methods, such as Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks, using labeled data from the Outlier Detection DataSets (ODDS). Our evaluation, which included standard metrics along with a novel mutation analysis technique, demonstrates that our method achieves accuracy comparable to existing techniques, while additionally providing interpretations of detected anomalies.

Statistics and Data Science Seminar
Wednesday, September 4, 2024; 11:00am
Speaker	Armin Schwartzman, Professor of Statistics, UC San Diego
Title	Estimating the fraction of variance of traits explained by high-dimensional genetic predictors
Abstract	The fraction of variance explained (FVE) by a model is a measure of the total amount of information for an outcome contained in the predictor variables. In Genome-Wide Association Studies (GWAS), the FVE of a trait by the SNPs (genetic loci) in the study is called SNP heritability. Because the number of predictors is much larger than the number of subjects, a classical regression model cannot be fitted and the effects of specific loci are difficult to identify. In this talk I give an overview of the main existing methods (others' and our own) to estimate the FVE in these high-dimensional regression models.

Statistics and Data Science Seminar
Wednesday, April 24, 2024; 11:00am
Speaker	Daniel R. Jeske, Professor and Vice Provost, University of California, Riverside
Title	On Combining Two Estimators with an Application to the Estimation of a Responder/Non-Responder Treatment Effect
Abstract	It is well-known how to combine two independent unbiased estimators of the same parameter. The independence of the estimators is often satisfied from the fact the two estimators are calculated from separate studies. In this paper, the performance of an optimal combined estimator when the estimators are correlated and biased is compared to the simple arithmetic average of the two estimators, and the effect of estimating the optimal combining weight is investigated. The advantage of combining two dependent and biased estimators is demonstrated in the context of estimating a responder/non-responder treatment effect in a randomized clinical trial.

Statistics Seminar
Wednesday, March 20, 2024; 11:00am
Speaker	Xin Zhang, Assistant Professor, SDSU Computer Science Department
Title	Enabling Urban Intelligence by Harnessing Human-Generated Spatial-Temporal Data
Abstract	Technology advancement in mobile sensing and communication has enabled a massive amount of mobility data to be generated from human decision-makers, which we call human-generated spatial-temporal data (HSTD). Applying the HSTD to extract the unique decision-making strategies of human agents and design human-centered urban intelligent systems (e.g., self-driving ride services) has transformative potential. They can not only promote the individual well-being of gig-workers, and improve service quality and revenue of transportation service providers, but also enable downstream applications in smart transit planning, efficient gig-work dispatching, safe autonomous vehicle (AV) routing, and so on. However, analyzing human decision strategies from HSTD is a challenging task. Human behaviors are complex and vary in different geographical locations (i.e., spatial challenge), and the quality of the learned strategies is also dependent upon the model expressibility (i.e., theoretical challenge). In addition, leveraging human decisions for human-centered smart cities has some practical gaps. This talk presents a picture of my work on human behavior analysis from HSTD based on imitation learning and its downstream applications. They focus on tackling the above challenges by providing solutions to the following research questions: (1) How to capture the unique human decision-making strategies leveraging HSTD? (2) How to design human-centered smart city services leveraging human decisions? By answering these questions, a series of works including cGAIL, f-GAIL and CAC are introduced with novel designs in problem formulation, model architecture, and algorithm. Extensive experiments support the effectiveness of the proposed models on human behavior analysis and self-driving decision-making from HSTD, and provide superior performance over state-of-the-art works.

Statistics Seminar
Wednesday, March 6, 2024; 11:00am
Speaker	Hossein Shirazi, Assistant Professor, SDSU Management and Information Systems
Title	Seeing Should Probably Not Be Believing: The Role of Deceptive Support in COVID-19 Misinformation on Twitter
Abstract	With the spread of the SARS-CoV-2, enormous amounts of information about the pandemic are disseminated through social media platforms such as Twitter. Social media posts often leverage the trust readers have in prestigious news agencies and cite news articles as a way of gaining credibility. Nevertheless, it is not always the case that the cited article supports the claim made in the social media post. We present a cross-genre ad hoc pipeline to identify whether the information in a Twitter post (i.e., a "Tweet") is indeed supported by the cited news article. Our approach is empirically based on a corpus of over 46.86 million Tweets and is divided into two tasks: (i) the development of models to detect Tweets containing claims and worth to be fact-checked and (ii) verifying whether the claims made in a Tweet are supported by the newswire article it cites. Unlike previous studies that detect unsubstantiated information by post hoc analysis of the patterns of propagation, we seek to identify reliable support (or the lack of it) before the misinformation begins to spread. We discover that nearly half of the Tweets (43.4%) are not factual and hence not worth checking—a significant filter, given the sheer volume of social media posts on a platform such as Twitter. Moreover, we find that among the Tweets that contain a seemingly factual claim while citing a news article as supporting evidence, at least 1% are not supported by the cited news and are hence misleading..

Statistics Seminar
Wednesday, February 28, 2024; 11:00am
Speaker	Mary Meyer, Ph.D., Professor of Statistics, Colorado State University
Title	Applications of constrained spline density estimation
Abstract	Density estimation methods often involve kernels, but there are advantages to using splines. Especially if the shape of the density is known to be decreasing, or unimodal, or bimodal, or if the shape of the density is the research question, splines allow the shape assumptions to be readily implemented. In addition, spline estimators enjoy a faster convergence rate compared to kernel density estimators. Applications include testing unimodal versus multimodal density, estimating a deconvolution density, robust regression, and testing for sampling bias.

Statistics Seminar
Tuesday, February 13, 2024; 11:00am
Speaker	Veronica Berrocal, Ph.D., Statistics, UC Irvine
Title	Bayesian non-stationary spatial modeling using shrinkage priors
Abstract	Any spatial statistical analysis often starts with a decision regarding how to model the spatial dependence structure: can the spatial process be thought of as stationary or non-stationary? While most parametric covariance functions assume stationarity, in the case of non-stationarity, a modeling choice could be to envision the process as globally non-stationary, but locally stationary. A drawback of this choice lies in the fact that identifying regions of non-stationarity remains still challenging, at least from a computational point of view. In this talk, we present two approaches that allow to identify regions of local stationarity by redefining and repurposing the Multi-Resolution Approximation (MRA) of Katzfuss (2017), which was introduced to lessen the computational burden encountered when analyzing large massive data. Both methods use the representation of the spatial process as a linear combination of appropriate basis function (the MRA basis functions), but differ in the shrinkage prior specification adopted for the basis function weights. Inference on the basis function weights and the spatial variability in the number of levels of resolutions needed provide information on whether the process can be considered stationary or not. We showcase the ability of these methods to correctly capture regions of local stationarity through simulation experiments. We also apply them to identify regions with different strengths of spatial dependence for two soil-related variables that are very important for climate sciences, soil organic carbon and soil moisture.

Statistics Seminar
Wednesday, November 29, 2023; 11:00am
Speaker	Professor Qingyun (Serena) Zhu, Management Information Systems, SDSU
Title	How Loud is Consumer Voice in Product Deletion Decisions? Retail Analytic Insights
Abstract	This study examines the role of online consumer reviews in product deletion decisions. Building upon product portfolio management theory we integrate consumer voice, represented by online consumer review behavior into organizational voice-strategic product deletion decision-making. The study also informs demand management where findings suggest that products with lower attribute ratings and comments having less relevance to lower-ranked attributes are more likely to be deleted. The linguistic retail analytic characteristics of online reviews also provide insights for product deletion decisions. Products with reviews having higher subjectivity, shorter length, and lower readability are more likely to be deleted. Pre-purchase consumer voice - the perceived helpfulness or unhelpfulness of online reviews also informed product deletion decisions. A general conclusion is that online reviews can provide important retail analytics for smarter retail operations planning at the strategic and tactical levels when it comes to product planning and portfolio management through product deletion.

Statistics Seminar
Wednesday, November 1, 2023; 11:00am
Speaker	Dr. Johanna Hardin, Mathematics and Statistics Department, Pomona College
Title	Technical Conditions in Normalizing ChIP-Seq Data
Abstract	ChIP-Seq (Chromatin immunoprecipitation followed by sequencing) data is widely used for studying the behavior of genome-wide protein-DNA interactions. One important biological question addressed by ChIP-Seq experiments is: does the amount of bound protein (at a particular region on the genome) change for different experimental conditions, i.e., is the region differentially bound? Standard statistical methods for finding differentially bound regions derive from well-known two sample tests (think: t-test), but statistical analyses require samples to be pre-normalized. In this talk, I will discuss the challenge of normalization, the methods for normalizing, and the technical conditions required for the normalization method to work. Simulation studies back up our work on deriving technical conditions from the format of the normalization methods.

Statistics Seminar
Wednesday, October 11, 2023; 11:00am
Speaker	Dr. Ronghui (Lily) Xu, Department of Mathematics and School of Public Health, UC San Diego
Title	Doubly robust estimation for time-to-event outcomes
Abstract	In this talk we review our works on doubly robust estimation for time-to-event outcomes, including the popular marginal structural Cox model and for dependently left truncated data. A common theme to these works is the well-known semiparametric theory, and a notable feature is the rate double robustness which allows machine learning or nonparametric approaches to be applied in order to estimate the nuisance parameters or functions. The latter circumvents compatibility issues surrounding nonlinear models like the proportional hazards one. Our main estimand of interest is a treatment effect, with or without randomization.

Statistics Seminar
Wednesday, September 27, 2023; 11:00am
Speaker	Dr. Zhe Fei, Department of Statistics, UC Riverside
Title	U-learning for Prediction Inference: With Applications to LASSO and Deep Neural Networks
Abstract	Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional predictors, presents challenges. We introduce a new U-learning approach for making ensemble predictions and constructing prediction intervals for continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hajek projection for deriving the asymptotic properties and yielding consistent variance estimates. We applied our approach with two commonly used predictive algorithms, the Lasso and deep neural networks (DNNs), and illustrate valid prediction inferences with extensive numeric examples. We applied the methods to predict the DNA methylation age of patients with different tissue samples, which may properly characterize the aging process and lead to anti-aging interventions.

Statistics Seminar
Wednesday, September 13, 2023; 11:00am
Speaker	Dr. Xin Wang, Assistant Professor, Department of Mathematics and Statistics, SDSU
Title	Clustered coefficient regression models for Poisson process with an application to seasonal warranty claim data
Abstract	Motivated by a product warranty claims data set, we propose clustered coeﬀicient regression models in a non-homogeneous Poisson process for recurrent event data. The proposed method, referred as CLUPP, can estimate the group structure and parameters simultaneously. In our proposed method, a penalized regression approach is used to identify the group structure. Numerical studies show that the proposed approach can identify the group structure well, and outperforms traditional methods such as hierarchical clustering and K-means. We also establish theoretical properties, which show that the proposed estimators can converge to true parameters in high probability. In the end, we apply our proposed methods to the product warranty claims data set, which achieve better prediction than the state-of-the-art methods.