Welcome to the Library (1400x200)

Archived Talks and Seminars

Reverse Chronological Order

Statistics and Data Science Seminar
Wednesday, September 10, 2025; 11:00am
Speaker Joann Chen, Department of Computer Science, SDSU
Title Envisioning the Future of Digital Privacy
Abstract

In today's digital world, personal data represents both immense opportunity and significant risk. When used responsibly, it can drive breakthroughs in areas such as healthcare and finance, yet its misuse can result in severe privacy breaches. This talk will examine the evolving landscape of data privacy, highlighting Differential Privacy (DP) as a promising approach that offers strong, provable guarantees. It will further explore privacy risks in machine learning and the principles of DP-aware system design, with particular focus on the challenges of integrating DP into diverse real-world systems.

 

Statistics and Data Science Seminar
Wednesday, September 3, 2025; 11:00am
Speaker Xiyue Liao, Department of Mathematics and Statistics, SDSU
Title A Framework for Comprehensive Model and Variable Selection
Abstract

We propose a framework for choosing variables and relationships without assuming additivity or parametric forms. The relationships between the response and each of the continuous predictors are modelled with regression splines and assumed to be smooth and one of the following: increasing, decreasing, convex, concave, or a combination of monotonicity and convexity. The eight shapes include a wide range of popular parametric functions such as linear, quadratic, exponential, etc., and the set of choices is appropriate if the component functions "do not wiggle." An ordinal predictor can have its set of possible orderings, such as increasing, decreasing, tree or umbrella orderings, no ordering, or constant. Interactions between continuous predictors will be modelled as multi-dimensional warped-plane spline surfaces, where the same possibilities for shapes are considered. We propose combining stepwise selection methods with information criteria, LASSO-type ideas, and model selection using a genetic algorithm.

 

Statistics and Data Science Seminar
Wednesday, April 23, 2025; 11:00am
Speakers Colleen Kelly, PhD, Amy Zimmer, and Sarah Blatt, Kelly Statistical Consulting
Title A Day in the Life of a Statistical Consultant
Abstract

Kelly Statistical Consulting, founded in 2009, has served local, national and international clients for over 15 years. This talk offers a glimpse into the dynamic and diverse work of a statistical consultant. The day-to-day activities range from team meetings to discuss project objectives to analyzing complex datasets and developing statistical models. Consultants tackle a range of intellectual challenges, working collaboratively to solve problems. We will discuss how our team ensures accurate results, balances multiple projects, communicates findings to various audiences, and keeps up with the latest programming and statistical techniques. Two example projects will be presented to illustrate the diversity of statistical applications. Additionally, we will share our opinions on required job skills and the best and most challenging aspects of our jobs.

 

Statistics and Data Science Seminar
Wednesday, April 9, 2025; 10:30am
Speaker Amber Puha, Department of Mathematics, Cal State San Marcos
Title Stationary Behavior of a Diffusion Limit for SRPT Queues with Heavy Tailed Processing Time Distributions
Abstract

In this talk, we explore the stationary behavior of diffusion limits in shortest remaining processing time (SRPT) queues with heavy-tailed processing times. The SRPT state descriptor is a measure-valued process that at each time has unit masses at the remaining processing time of each job in system. Banerjee, Budhiraja and Puha (2022) have shown that, under proper scaling, this state descriptor converges to a measure-valued stochastic process, characterized by workload processes modeled as reflecting coupled Brownian motions with a specific negative drift function. Motivated by the form of the limit, we study reflecting coupled Brownian motions with a general nondecreasing negative drift function. We analyze the stationary behaviors of the distributions of the resulting measure-valued process and its moments through the maximum process of the coupled Brownian motions. Additionally, we derive joint distributions and the covariance structure of the maximum process, offering new insights into the stationary distribution of the queue length in SRPT systems.

 

Statistics and Data Science Seminar
Wednesday, March 19, 2025; 11:00am
Speaker Adir Mancebo, Science Manager, Data Science Alliance
Title Introduction to the Framework for Responsible Data Science Practices
Abstract

This seminar introduces the Data Science Alliance's Framework for Responsible Data Science Practices, a comprehensive guide developed in collaboration with experienced data professionals from various industries and academia. The presentation establishes fundamental concepts of responsible data science and presents the four guiding principles that form its foundation. It illustrates the importance of ethical considerations in data science and AI applications, demonstrating how these powerful tools can impact society. Through an overview of the data project lifecycle and examination of real-world case studies, the session highlights the consequences of neglecting responsible practices. Interactive discussions encourage participants to consider practical applications of responsible data science principles across different project phases. Attendees will gain valuable insights into implementing responsible practices that maximize benefits while minimizing potential harm in their data science initiatives.

 

Statistics and Data Science Seminar
Wednesday, March 5, 2025; 11:00am
Speaker Victoria Delaney, San Diego State University
Title Using Statistics to Understand AI, and Using AI to Understand Statistics: Teaching and Learning in the Age of AI
Abstract

Today’s youth must understand ideas about statistics, data, algorithms, and complex technologies to a far greater degree than generations before them. Having a coordinated understanding of these ideas will better enable them to participate thoughtfully in society, remain competitive for the job market, and critique AI tools if and when appropriate. However, many youth do not receive this type of statistical, data-driven education, or are even aware of how it might benefit them in the first place. It is my belief that math educators must help students see value in learning about statistics, data science, and AI from a much earlier age.

Teaching with and about AI invites new pedagogical and curricular challenges. In my presentation, I will review two. First, I illustrate challenges that high school statistics teachers faced when teaching students about machine learning classification tasks in the context of facial recognition. Students generally understood that machine learning involves modeling with data. They struggled with the freedom they were given when designing classification models, as it involved nonroutine thinking and iterative testing that typical high school math curricula do not incorporate. In my second example, I present cases of college undergraduates solving introductory statistics tasks with AI chatbots. I showcase the example of Nadya, who demonstrated correct probabilistic reasoning on the task confidently, until the AI chatbot’s output caused her to second guess her answer. Both studies illustrate tensions that may arise when teaching or learning about statistics, data, and AI. I argue that today’s teachers must confront these tensions with novel forms of pedagogy, course design, curricular tasks, and deliberate attention to the impact that AI will have on statistics education.

 

Statistics and Data Science Seminar
Wednesday, February 19, 2025; 11:00am
Speaker Nadia B. Mendoza, San Diego State University
Title Imputation Methods for Incomplete Data with an Application to Improve Accuracy when Predicting HIV Status
Abstract

Standard statistical analyses often exclude incomplete observations, which can be particularly problematic when predicting rare outcomes. Taking into account the missing data mechanisms, imputation methods can be used to recover the incomplete cases and include them in models and analysis. Four imputation methods with different approaches are presented and evaluated in this study - amelia, hmisc, mice and missForest (all available in R). In the linkage to HIV care dataset, there were initially 553 complete HIV positive cases, an additional 554 cases were added through imputation. Simulations were conducted across various scenarios using the complete data to guide imputation for the full dataset. A random forest model was used to predict HIV status, assessing imputation precision, overall prediction accuracy, and sensitivity. While missForest produced imputed values closer to the observed ones, this did not translate into better predictive models. Hmisc and mice imputations led to higher prediction accuracy and sensitivity, with median accuracy increasing from 64% to 76% and median sensitivity rising from 0.4 to 0.75. Hmisc and amelia were the fastest imputation methods. Additionally, oversampling the minority class combined with undersampling the majority class did not improve predictions of new HIV positive cases using only the complete observations. However, increasing the minority class information through imputation enhanced sensitivity for predicting cases in this class.

 

Statistics and Data Science Seminar
Wednesday, February 5, 2025; 11:00am
Speaker Jonathan Helm, Associate Professor of Psychology, San Diego State University
Title Small Sample Solutions to Structural Equation Models for Psychology Research
Abstract

In psychology research, structural equation models (SEM) are often used to test hypotheses. Typically, a researcher using SEM will specify a set of hypothesized relations (i.e., a model) amongst a set of variables, assume that those variables follow a multivariate normal distribution, and estimate the relations implied by the model using maximum likelihood. Furthermore, psychology researchers typically compare models using a likelihood ratio test, which compares the difference in the log-likelihoods (multiplied by -2) to a chi-square distribution with degrees of freedom equal to the difference in number of estimated parameters across models (i.e., Wilks’ theorem). One drawback of the likelihood ratio test concerns its reliance on asymptotic theory (the chi-square distribution is the limiting distribution as sample size approaches infinity), and therefore p-values from the test tend to be underestimated in small samples, leading to increased Type-1 error rates. Over the years, psychology researchers have attempted to both (a) provide minimum sample size recommendations, and (b) provide various adjustments to the likelihood ratio test to account for sample size. However, minimum sample size requirements vary depending on the proposed model (i.e., there is no ‘one-size-fits all’), and the proposed adjustments have mixed results (again, they work well for some, but not all models).

In this talk, I will propose an alternative strategy to both understand and find a solution for the small sample size issue that persists for SEM. In particular, I will attempt to generalize the application of Wilks’ Lambda (i.e., the ratio of determinants of two matrices; wholly separate from Wilks’ theorem above) to SEM, recognizing that Wilks’ Lambda distribution can account for small samples (when data stem from multivariate normal distributions). The application appears to work well in some simple cases by exactly replicating results (e.g., t-tests, correlation, regression, and ANOVA). However, confusion arises for more complex cases. In particular, it is not entirely clear how to generalize the concept of degrees of freedom from the simple to complex cases, and I am hoping the group may have feedback or insights to share.

 

Statistics and Data Science Seminar
Wednesday, December 4, 2024; 11:00am (virtual)
Speaker Gabriela Fernandez, Department of Geography, San Diego State University
Title Metabolism of Cities Living Lab at San Diego State University
Bio

Dr. Fernandez is the Director of the Metabolism of Cities Living Lab (MOC-LLAB) under the Center for Human Dynamics in the Mobile Age at San Diego State University (SDSU) working to localize the United Nations Sustainable Development Goals in southern California, the Baja California Mexico region, and globally. Dr. Fernandez is a Professor and Graduate Advisor in the Department of Geography’s Master of Science in Big Data Analytics Program at SDSU. She received a Ph.D. in Urban Planning, Design, and Policy from the Department of Architecture and Urban Studies at the Politecnico di Milano in Milan, Italy. She received a B.S. in Public Administration with an emphasis in City Planning and a Master of City Planning at SDSU. Her research interests include urban metabolism ideologies and material flow analysis of metropolitan cities. Identifying urban typologies and socioeconomic indicators in the urban context while promoting public policy, smart cities, big data analytics, circular economy methods, UN SDGs, and under-representative populations.

 

Statistics and Data Science Seminar
Wednesday, November 13, 2024; 11:00am in GMCS 405
Speaker Dan Gillen, Department of Statistics, University of California Irvine
Title Censoring Robust Estimation in the Nested Case-Control Study Design with Applications to Biomarker Development in AD
Abstract

Biomarkers play a critical role in the early diagnosis of disease and can serve as targets for disease interventions. As such, biomarker discovery is of primary scientific interest in many disease settings. One example of this occurs in Alzheimer’s disease (AD), a neurodegenerative disease that affects memory, thinking, and behavior. Amyloid beta (Aβ) and phosphorylated tau (p-tau) are protein biomarkers that have become key to the early diagnosis of AD. In fact, the first novel therapy since 2003, Aduhelm, recently received accelerated approval from the US FDA based on demonstrated changes in Aβ. Despite this recent success, Aβ and p-tau are not perfect discriminators of disease and, hence, biomarker discovery for time-to-progression of disease remains a primary objective in AD research. Analysis of time-to-event data using Cox's proportional hazards (PH) model is ubiquitous in the discovery process. Most commonly, a sample is taken from the population of interest and covariate information is collected on everyone. If the event of interest is rare and it is difficult or not feasible to collect full covariate information for all study participants, the nested case-control design reduces costs with minimal impact on inferential precision. However, no work has been done to investigate the performance of the nested case-control design under model mis-specification. In this talk we show that outside of the semi-parametric PH assumption, the statistical estimand under the nested case-control design will depend not only on the censoring distribution, but also on the number of controls sampled at each event time. This is true in the case of a binary covariate when the proportional hazards assumption is not satisfied, and in the case of a continuous covariate where the functional form is mis-specified. We propose estimators that allow us to recover the statistic that would have been computed under the full cohort data as well as a censoring-robust estimator. Asymptotic distributional theory for the estimators is provided along with empirical simulation results to assess finite samples properties of the estimators. We conclude with examples considering common biomarkers for AD progression using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI).

 

Statistics and Data Science Seminar
Wednesday, October 16, 2024; 11:00am in GMCS 405
Speaker Loki Natarajan, Professor Emerita of Public Health, University of California San Diego
Title Wearable sensors to monitor physical activity: statistical approaches and challenges
Abstract

Physical activity and sedentary behavior are known to impact health and well-being. Wearable sensors, such as accelerometers are widely used for tracking human movement and provide estimates of activity every minute (or at even finer granularity, e.g., 30 Hz). Most research studies aggregate these activity records to daily or weekly summary statistics. However, this aggregation can result in a loss of information. Statistical methods for leveraging the full spectrum of accelerometer time series have been the focus of much recent research. In this talk, we will discuss two specific approaches, Functional data analysis and Machine learning classification methods, for modeling accelerometer-derived activity profiles, and demonstrate their applications in public health.

 

Statistics and Data Science Seminar
Wednesday, October 2, 2024; 11:00am
Speaker Rob Malouf, Professor of Linguistics, San Diego State University
Title Constructions in language change: the case of English help/help to
Abstract

This talk explores the application of statistical methods to the study of language as a complex system, focusing on linguistic variation and change through the specific case of "help" and "help to" constructions in English. Viewing language as a complex adaptive system characterized by interactions between speakers, linguistic structures, and social contexts, we employ a large historical corpus and Bayesian mixed-effects regression to identify factors influencing the development of these constructions change over time. The results reveal two distinct patterns of change in the usage of these constructions, demonstrating the value of modern quantitative methods in the study of language dynamics and the mechanisms of language change.

 

Statistics and Data Science Seminar
Wednesday, September 11, 2024; 11:00am
Speaker Hajar Homayouni, Professor of Computer Science, San Diego State University
Title Anomaly Detection and Interpretation from Tabular Data Using Transformer Architecture
Abstract

This project introduces a novel anomaly detection and interpretation approach utilizing a transformer-based architecture that reduces preprocessing needs by converting tabular data rows into a sentence-like structure and generates explanations for anomalies in the form of rule violations. Our approach consists of two main components: the Anomaly Detector and the Anomaly Interpreter. The Anomaly Detector utilizes a Transformer model with a customized embedding layer tailored for tabular data structures. While attention weights do not directly explain the model's reasoning, they can provide valuable insights when interpreted carefully. The Anomaly Interpreter uses attention weights to identify potential issues in anomalous data by comparing these weights with patterns in normal data. When the model labels a row as anomalous, the Interpreter examines closely related columns and compares their associations with those in the normal dataset using a benchmark association matrix. Deviations from typical associations are flagged as potential rule violations, highlighting unusual column pair relationships. We evaluated our approach against conventional methods, such as Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks, using labeled data from the Outlier Detection DataSets (ODDS). Our evaluation, which included standard metrics along with a novel mutation analysis technique, demonstrates that our method achieves accuracy comparable to existing techniques, while additionally providing interpretations of detected anomalies.

 

Statistics and Data Science Seminar
Wednesday, September 4, 2024; 11:00am
Speaker Armin Schwartzman, Professor of Statistics, UC San Diego
Title Estimating the fraction of variance of traits explained by high-dimensional genetic predictors
Abstract

The fraction of variance explained (FVE) by a model is a measure of the total amount of information for an outcome contained in the predictor variables. In Genome-Wide Association Studies (GWAS), the FVE of a trait by the SNPs (genetic loci) in the study is called SNP heritability. Because the number of predictors is much larger than the number of subjects, a classical regression model cannot be fitted and the effects of specific loci are difficult to identify. In this talk I give an overview of the main existing methods (others' and our own) to estimate the FVE in these high-dimensional regression models.