Statistics Seminar | |

Wednesday, April 6, 2022; 11:00am | |

Speaker | David Goldberg, Assistant Professor, SDSU Management and Information Systems |

Title | Leveraging online reviews for product safety surveillance |

Abstract |
Product safety concerns pose enormous risks to consumers across the world. As discussions of consumer experiences have spread online, this research proposes the use of text mining to rapidly screen online media for mentions of safety hazards. The text mining approach in this research identifies unique words and phrases, or “smoke terms,” in online posts that indicate safety hazard-related discussions. In addition, this research shows that text mining-based risk can be analyzed on the product level, ranking products from most risky to least risky. The research has implications for monitoring the potential safety hazards in products already on the market and reacting to potential issues as quickly as possible. |

Statistics Seminar | |

Wednesday, March 16, 2022; 11:00am | |

Speaker | Mark Huber, Professor, Claremont McKenna College |

Title | Robust estimators for Monte Carlo data |

Abstract |
Data coming from Monte Carlo experiments is often analyzed in the same way as data from more traditional sources. The unique nature of Monte Carlo data, where it is easy to take a random number of samples, allows for estimators where the user can control the relative error of the estimate much more precisely than with classical approaches. In this talk I will discuss three such estimators useful in different problems. The first is a user-specified-relative-error (USRE) estimate for the mean of a Bernoulli random variable. This allows us to obtain exact error results while using slightly fewer samples than the CLT approximation. The second is more general, applying to any random variable where a bound on the relative error is known. For this problem we give exact error bounds using a number of samples that is the same (to first order) as the CLT approximation requires. In other words, the new algorithm is the equivalent of always actually having normal data. Finally, we look at the problem of data with unknown variance and develop an algorithm that runs very close to the minimum number of samples established using results of Wald. |

Mathematics Colloquium Distinguished Lecture Series | |

Monday, March 7, 2022; 4:00 pm |

Speaker | Mark Alber, Department of Mathematics, UC Riverside |

Title | Combined multiscale mathematical modeling and experimental analysis suggests possible mechanism of shoot meristem maintenance in plants |

Abstract | Stem cell maintenance in multilayered shoot apical meristems (SAMs) of plants requires strict regulation of cell growth and division. Exactly how the complex milieu of chemical (WUSCHEL and cytokinin) and mechanical signals interact to determine cell division plane orientation and shape of the SAM is not well understood. By using a newly developed mathematical model, combined with experiments, three hypothesized mechanisms have been tested for the regulation of cell division plane orientation as well as of cell expansion in the deeper SAM cell layers. Simulations predict that in the apical cell layers, WUSCHEL and cytokinin regulate the direction of anisotropic cell expansion, and cells divide according to tensile stress. In the basal cell layers, simulations also show dual roles for WUSCHEL and cytokinin in regulating both cell division plane orientation and the direction of anisotropic expansion. This layer-specific mechanism maintains the experimentally observed shape and structure of the SAM as well as the distribution of WUSCHEL in the tissue [1]. Moreover, by using a dynamical signaling model, an additional mechanism underlying robustness maintenance of WUSCHEL gradient through its negative regulator, has been identified. Sensitivity analysis and perturbation study were performed to show validity of the mechanism across different parameter ranges [2]. Currently, a coupled computational framework is being developed by integrating sub models representing a dynamical signaling network and cell mechanics to explore how the WUSCHEL expression domain and the tissue structure are maintained throughout the growth. |

Bio | Professor Mark Alber earned his Ph.D. in mathematics at the University of Pennsylvania under the direction of J. E. Marsden (UC Berkeley and Caltech). He held several positions at the University of Notre Dame including most recently Vincent J. Duncan Family Chair in Applied Mathematics. He is currently Distinguished Professor in the Department of Mathematics and Director of the Center for Quantitative Modeling in Biology, UC Riverside. Dr. Alber was elected a Fellow of the American Association for the Advancement of Science (AAAS) in 2011. He is currently a deputy editor of PLoS Computational Biology and member of editorial boards of Bulletin of Mathematical Biology and Biophysical Journal. His research interests include mathematical and computational multiscale modeling of blood clot formation, plants development and growth and epithelial tissue growth. |

Resources | Poster |

Statistics Seminar | |

Friday, January 28, 2022; 11:00am | |

Speaker | Hajar Homayouni, Assistant Professor, Computer Science, San Diego State University |

Title | Anomaly detection and explanation in big data |

Abstract |
Data quality tests are used to validate the data stored in databases and data warehouses, and to detect violations of syntactic and semantic constraints. Domain experts grapple with the issues related to the capturing of all the important constraints and checking that they are satisfied. The constraints are often identified in an ad hoc manner based on the knowledge of the application domain and the needs of the stakeholders. Constraints can exist over single or multiple attributes as well as records involving time series and sequences. The constraints involving multiple attributes can involve both linear and non-linear relationships among the attributes. We propose ADQuaTe as a data quality test framework that automatically (1) discovers different types of constraints from the data, (2) marks records that violate the constraints as suspicious, and (3) explains the violations. Domain knowledge is required to determine whether or not the suspicious records are actually faulty. The framework can incorporate feedback from domain experts to improve the accuracy of constraint discovery and anomaly detection. We instantiate ADQuaTe in two ways to detect anomalies in non-sequence and sequence data. The first instantiation (ADQuaTe2) uses an unsupervised approach called autoencoder for constraint discovery in non-sequence data. ADQuaTe2 is based on analyzing records in isolation to discover constraints among the attributes. We evaluate the effectiveness of ADQuaTe2 using real-world non-sequence datasets from the human health and plant diagnosis domains. We demonstrate that ADQuaTe2 can discover new constraints that were previously unspecified in existing data quality tests and can report both previously detected and new faults in the data. We also use non-sequence datasets from the UCI repository to evaluate the improvement in the accuracy of ADQuaTe2 after incorporating ground truth knowledge and retraining the autoencoder model. The second instantiation (IDEAL) uses an unsupervised LSTM-autoencoder for constraint discovery in sequence data. IDEAL analyzes the correlations and dependencies among data records to discover constraints. We evaluate the effectiveness of IDEAL using datasets from Yahoo servers, NASA Shuttle, and Colorado State University Energy Institute. We demonstrate that IDEAL can detect previously known anomalies from these datasets. Using mutation analysis, we show that IDEAL can detect different types of injected faults. We also demonstrate that the accuracy of the approach improves after incorporating ground truth knowledge about the injected faults and retraining the LSTM-Autoencoder model. The novelty of this research lies in the development of a domain-independent framework that effectively and efficiently discovers different types of constraints from the data, detects and explains anomalous data, and minimizes false alarms through an interactive learning process. |

Statistics Seminar | |

Wednesday, December 8, 2021; 11:00am | |

Speaker | Hao Helen Zhang, Professor, Department of Mathematics, University of Arizona |

Title | Scalable and model-free methods for multiclass probability estimation |

Abstract | Classical approaches for multiclass probability estimation are mostly model-based, such as logistic regression or LDA, by making certain assumptions on the underlying data distribution. We propose a new class of model-free methods to estimate class probabilities based on large-margin classifiers. The method is scalable for high-dimensional data by employing the divide-and-conquer technique, which solves multiple weighted large-margin classifiers and then constructs probability estimates by aggregating multiple classification rules. Without relying on any parametric assumption, the estimates are shown to be consistent asymptotically. Both simulated and real data examples are presented to illustrate the performance of the new procedure. |

Statistics Seminar | |

Wednesday, November 17, 2021; 11:00am | |

Speaker | Ying Zhang, Department of Biostatistics, College of Public Health, University of Nebraska Medical Center |

Title | An efficient method for clustering longitudinal data |

Abstract | Longitudinal data clustering is a challenging task, especially with sparse and irregular observations. It lacks reliable methods in the literature that deal with clustering complicated longitudinal data, particularly with multiple longitudinal outcomes. In this manuscript, a new agglomerative hierarchical clustering method is developed in conjunction with B-spline curve fitting and construction of unique dissimilarity measure for differentiating longitudinal observations. In an extensive simulation study, the proposed method demonstrates its superior performance in clustering accuracy and numerical efficiency compared to the existing methods. Moreover, the method can be easily extended to multiple-outcome longitudinal data without too much cost in computation and shows its robust results against the complexity of underlying mixture of longitudinal data. Finally, the method is applied to a date set from the SPRINT Study for validating the intervention efficacy in a Systolic Blood Pressure Intervention Trial and to a 12-year multi-site observational study (PREDICT-HD) for identifying the disease progression patterns of Huntington’s disease (HD). |

Mathematics Colloquium Distinguished Lecture Series | |

Monday, November 15, 2021; 4:00 pm |

Speaker | Mason Porter, Department of Mathematics, UCLA |

Title | Topological data analysis of spatial complex systems |

Abstract | From the venation patterns of leaves to spider webs, roads in cities, social networks, and the spread of COVID-19 infections and vaccinations, the structure of many systems is influenced significantly by space. In this talk, I'll discuss the application of topological data analysis (specifically, persistent homology) to spatial systems. I'll discuss a few examples, such as voting in presidential elections, city street networks, spatiotemporal dynamics of COVID-19 infections and vaccinations, and webs that were spun by spiders under the influence of various drugs. |

Bio | Mason Porter is a professor in the Department of Mathematics at UCLA. He earned a B.S. in Applied Mathematics from Caltech in 1998 and a Ph.D. from the Center for Applied Mathematics at Cornell University in 2002. He held postdoctoral positions at Georgia Tech, the Mathematical Sciences Research Institute, and Caltech. He joined the faculty at University of Oxford in 2007 and moved to UCLA in 2016. Mason is a Fellow of the American Mathematical Society, American Physical Society, and Society for Industrial and Applied Mathematics. In recognition of his mentoring of undergraduate researchers, Mason won the 2017 Council on Undergraduate Research (CUR) Faculty Mentoring Award in the Advanced Career Category in the Mathematics and Computer Science Division. Thus far, 24 students have completed their PhD degrees under Mason's mentorship. Mason has also mentored several postdocs, more than 30 Masters students, and more than 100 undergraduate students on research projects. Mason's research interests lie in theory and (rather diverse) applications of networks, complex systems, and nonlinear systems. |

Resources | Poster, Slides |

Statistics Seminar | |

Wednesday, November 10, 2021; 11:00am | |

Speaker | Chieko Seto, Operations Research Analyst, County of San Diego |

Title | What does a data analyst do in local government? |

Abstract | This seminar explains what a data analyst do for the local government using statistics and data science by showing various projects including the COVID reporting from the San Diego County Emergency Center. The speaker will also explain what tools were used for each project, such as SQL, Tableau, and dashboard development. |

Statistics Seminar | |

Wednesday, November 3, 2021; 11:00am | |

Speaker | Dr. Jun Li, Professor, Department of Statistics, University of California Riverside |

Title | Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem and change-point detection problem |

Abstract | Recent advances have greatly facilitated the collection of high-dimensional data in many fields. Often the dimension of the data is much larger than the sample size, the so-called high dimension, low sample size setting. Thanks to their simplicity of computation, interpoint distance-based procedures are particularly appealing for analyzing small samples of high-dimensional data. In this talk, we first study the asymptotic distribution of interpoint distances in the high-dimension, low sample size setting, which is shown to be normal under regularity conditions. Using this newly developed asymptotic result, we then show how to construct asymptotic distribution-free procedures for the two-sample problem and change-point detection problem in the high dimension, low sample size setting. Our proposed procedures are proven to be consistent for detecting both location and scale differences. Simulations and real data analysis show that our proposed procedures compare favorably with other existing distance-based methods across a variety of settings. |

Statistics Seminar | |

Wednesday, October 20, 2021; 11:00am | |

Speaker | Afrooz Jahedi, Department of Molecular and Translational Pathology, University of Texas MD Anderson Research Center |

Title | Mixed-effects random forests-based classification algorithm for clustered data - an application to autism spectrum disorders |

Abstract | To date, a variety of classification schemes have been proposed, and the accuracy
of classification has reached as high as 95 percent for many disorders, including
Autism Spectrum Disorder (ASD). However, to build a reliable and robust classification
model for ASD, it is necessary to incorporate a large dataset which is often obtained
from multi-site imaging data. In addition to the extended sample size, including multiple
MRI modalities can provide a more comprehensive brain picture. However, two challenges
are associated with a multi-site and multi-modal dataset. The first issue is controlling
the source of variation that is imposed by multiple imaging sites. Second, it is necessary
to use a dimensional reduction algorithm to cope with the computational complexity
of multi-modal data. Separating the multi-site variability from the heterogeneous
nature of ASD is particularly important in order to build a robust and accurate classification
model. We address both concerns by proposing two mixed-effects random forest-based classification algorithms, applicable to multi-site (clustered) data using rs-fMRI and structural MRI (sMRI) modalities. These algorithms control the random effects of the confounding factor of the imaging site. Additionally, the algorithms internally control the fixed effect of the phenotypic variables such as age while building classification model. Moreover, they eliminate the necessity of utilizing a separate dimension reduction algorithm for high-dimensional data such as functional connectivity in a non-linear fashion. In our empirical data study, we use an external validation set including resting-state fMRI and sMRI from the ABIDE dataset. The RFME-based classification algorithms show an accuracy of over 80 percent to distinguish ASD participants from typically developing (TD) participants. We conclude that the RFME-based classification model is a promising tool to build a more reliable and efficient diagnostic model for multi-site and multi-modal datasets. |

Statistics Seminar | |

Wednesday, September 15, 2021; 11:00am | |

Speaker | Richard Levine, Department of Mathematics and Statistics, San Diego State University |

Title | Making an Impact in an Institutional Research Office: On Data Champions and Machine Learning |

Abstract | As a strategy to support data-informed decision making at SDSU, Analytic Studies & Institutional Research (IR) established a Statistical Modeling Group (SMG) within its operation. SMG is a collaborative team of machine learning experts from the Stat Dept and IR data management and visualization experts tasked with developing and applying predictive analytics methods to solve institutional effectiveness problems. We will highlight the role of SMG on our campus. Focusing on SMG success stories in STEM program retention and graduation success, we will 1) introduce the predictive analytics infrastructure and machine learning methods developed for student success efficacy studies; 2) show novel visualizations and dashboards developed for STEM advisors and campus administrators; 3) outline the Data Champions program instituted to expand University data capabilities and leverage our analytic tools to inform SDSU efforts to improve student success metrics; and 4) present our vision for Statistical Modeling Groups in IR units as an effective strategy for training Statistics/Data Science graduate students and delivering actionable information to campus stakeholders. |