Seminars

Statistics seminars are held on Wednesdays 14:00 – 15:00. Everyone is welcome!  We gather for coffee/tea and biscuits around 15 minutes before the seminar begins.

The organiser is Ben Baer.  Please contact Ben to find out more about the seminars, to suggest a future seminar speaker, or to request joining seminars online.

Most of the seminars this year will be held in-person and few online. The in-person seminars will be held at the Observatory seminar room. Please see below for more details.

Forthcoming statistics seminars 2024-25

  • 23 April: Helen Warren, Queen Mary University of London

This seminar will be online-only. 

Title: Statistical Genetics: my Perspectives; Polygenic risk scores & Pharmacogenetics

Abstract: Dr Helen Warren is a Senior Lecturer in Statistical Genetics and has been a member of research staff at the Centre for Clinical Pharmacology and Precision Medicine at the William Harvey Research Institute, Queen Mary University of London, since 2013.

Her research focuses on the genetics of cardiovascular traits, with applications to genetic discovery (especially for blood pressure & hypertension), pharmacogenetics (e.g. for response to statin therapy and for antihypertensive drug response), polygenic risk scores, and risk prediction.

To be accessible to a wide audience, her talk will have 3 parts: (i) her Perspective, an insight into the life of a Statistical Geneticist, giving her own career path background, and an introduction to statistical genetics to provide context; (ii) the use of Polygenic Risk scores from Genome-Wide Association Study data, an overview of the different methods and applications being attempted; (iii) Pharmacogenetics: focusing on the important issues of study design, model choice and comparisons, bias, etc.

Past seminars

This academic year

  • 18th September (joint with CREEM): Hannah Worthington, University of St Andrews.

Title: Capture-recapture Models: A lifetime expectation perspective

Abstract: Capture-recapture(-recovery) models featuring time- and age-dependent parameters are commonly used to offer biologically reasonable structures for features of a population.  In particular, survival probabilities are often strongly linked to age, for example, showing high mortality in young and old individuals, or different survival probabilities for different age classes (e.g. first-year, sub-adult, breeding adult etc.).  Unfortunately, fully-age dependent models which allow for a different probability of survival in each year of life tend to result in estimating very large numbers of parameters.  We propose taking a semi-Markov approach to offer a straightforward mechanism to include an age-component to survival whilst requiring far fewer parameters.  Instead of considering this problem from the perspective of survival from one year to the next, we instead consider the distribution for the age of death. Adding in additional temporal elements however, to account for adverse or favourable environmental conditions, creates some difficulties.  I’ll present our current ideas that look to embed a random walk structure into the model to overcome these challenges. 

  • 25th September: Nguyen Dang, University of St Andrews

Title: Reinforcement Learning for Dynamic Algorithm Configuration.

Abstract: Most algorithms have their own parameters that need to be tuned to achieve the best performance. In some cases, instead of finding the best static parameter setting for an algorithm, it is highly beneficial to adapt the parameter values while the algorithm is running. Dynamic Algorithm Configuration (DAC) focuses on developing techniques to solve this task in an automated and data-driven fashion. The aim is to learn a policy that map from the current state of the algorithm to the best parameter value for that state during the solving process. DAC is an emerging topic and has a lot of potential applications in various domains. Given the dynamic nature of the task, Reinforcement Learning (RL) seems like a suitable family of techniques for tackling DAC problems. However, research on DAC methods is still in its early stage. It is not clear whether RL methods, which were original developed for other domain applications such as robotics and game playing, are effective in DAC contexts. In this talk, I will give a brief introduction to DAC and present our recent study on benchmarking a commonly-used RL algorithm on DAC.

  • 23rd October: Rui Borges, University of St Andrews

Title: A rant about mutation models in population genetics

Abstract: Mutations are essential drivers of evolution, and their mathematical modeling in population genetics depends on how we perceive their frequency and the timescales at which they occur. A common assumption is that mutations are rare, and by the time a new mutation arises, the previous one has either been fixed or lost from the population. However, more realistic models should account for reversible or even recurrent mutations. In this talk, I compare different mutation models, focusing on their implications for two very important inferential tasks in evolutionary biology: estimating effective population sizes and reconstructing phylogenies. Finally, I will introduce the concept of the distribution of fitness effects, highlight its fundamental role in molecular evolution in describing the fate of new mutations, and discuss my current approach to infer this distribution using genomic data.

  • 30th October (joint with CREEM): Simon Wood, University of Edinburgh

Title: Neighbourhood Cross Validation and modelling under spatial correlation without a spatial correlation model. 

Abstract: Cross validation comes in many varieties, but some of the more interesting flavours require multiple model fits with consequently high cost. This talk shows how the high cost can be side-stepped for a wide range of models estimated using a quadratically penalized smooth loss, with rather low approximation error. Once the computational cost has the same leading order as a single model fit, it becomes feasible to efficiently optimize the chosen cross-validation criterion with respect to multiple smoothing/precision parameters. Interesting applications include cross-validating smooth additive quantile regression models, and the use of leave-out-neighbourhood cross validation for dealing with nuisance short range autocorrelation. The link between cross validation and the jackknife can be exploited to obtain reasonably well calibrated uncertainty quantification in these cases.

  • 6th November (joint with CREEM): Andrew Solow, Woods Hole Oceanographic Institute

Title: The use of sighting records in ecology

Abstract: This talk presents three examples of the use of sighting records of individual animals to address ecological issues:  population declines in the Yangtze river dolphin, the extinction of the Ivory-billed Woodpecker, and the rediscovery of the polecat in Scotland.  Technical material will be kept to a reasonable minimum. 

  • 13th November (joint with CREEM): Regina Bispo, University of St Andrews

Title: Breezes, Blazes, and Stats: A Research Journey

Abstract: In this talk, I will summarize my research on estimating wildlife fatalities at onshore wind farms and, more recently, on modelling the occurrence of both urban and rural fires. 

Understanding the impact of onshore wind farms on avian and bat populations requires mortality estimation. In this context, we want to estimate the number of deaths driven by collision with the wind farm structures. Mortality assessment is typically based on counting detected carcasses underneath turbines. However, there are several sources of uncertainty, including carcass removal (e.g., by scavengers) and the observers’ detection ability. Moreover, mortality rates vary across space and time, influenced by turbine placement and changing collision risks. 

Urban fires remain a major threat, contributing to property damage, physical injury, and loss of life. High population density and socio-economic factors can further amplify fire risk and firefighting costs. On the other hand, wildfires represent a global challenge. In Portugal, despite a declining trend in the number of rural fires, the total burned area has increased in the last years, reaching 110,097 hectares in 2022. This shift is tied to a rise in large, intense fires, which are often linked to climate change and result in more extensive environmental impact and higher socio-economic costs.  I will conclude my presentation by sharing some recent ongoing work on modelling the occurrence and size of rural fires in Portugal.  

  • 20th November: Sara Wade (rescheduled)
Title: Understanding uncertainty in Bayesian cluster analysis
 

Abstract: The Bayesian approach to clustering is often appreciated for its ability to provide uncertainty in the partition structure. However, summarizing the posterior distribution over the clustering structure can be challenging. Wade and Ghahramani (2018) proposed to summarize the posterior samples using a single optimal clustering estimate, which minimizes the expected posterior Variation of Information (VI).  In instances where the posterior distribution is multimodal, it can be beneficial to summarize the posterior samples using multiple clustering estimates, each corresponding to a different part of the space of partitions that receives substantial posterior mass. In this work, we propose to find such clustering estimates by approximating the posterior distribution in a VI-based Wasserstein distance sense. An interesting byproduct is that this problem can be seen as using the k-mediods algorithm to divide the posterior samples into different groups, each represented by one of the clustering estimates. Using both synthetic and real datasets, we show that our proposal helps to improve the understanding of uncertainty, particularly when the data clusters are not well separated, or when the employed model is misspecified. 

  • 27th November (joint with CREEM): April Zhou, Lancaster University

Title: Using Simulation Optimisation to Solve the Reserve Site Selection Problem

Abstract: The Reserve Site Selection (RSS) problem aims to select a combination of sites from potential sites to assemble a reserve that meets specific conservation goals. Traditionally, it is formulated as a mathematical programming problem, which often fails to capture the complexity of ecosystems. Stochastic simulation models can help capture this complexity, but they are typically used in an exploratory way rather than for finding optimal solutions. Simulation Optimisation (SO) overcomes the challenges of both methods by finding optimal solutions via stochastic simulations.

In our research, we formulate the RSS problem as an SO problem with the goal of finding the best combination of sites that not only minimises cost but also ensures that species survival probabilities meet desired thresholds. We use the grey wolf (Canis lupus) as a case study to examine the performance of SO in solving RSS problems.

This talk will cover the problem formulation, the solution methods, and two enhancements aimed at improving these methods.

  • 4th December: Sjoerd Victor Beentjes, University of Edinburgh

Title: Semi-parametric efficient estimation of small genetic effects in large-scale population cohorts

Abstract: We present a unified statistical workflow for the semiparametric efficient and double robust estimation of n-point interactions amongst categorical variables in the presence of confounding and weak population dependence. N-point interactions, or Average Interaction Effects (AIEs), are a direct generalisation of the usual average treatment effect (ATE). We estimate AIEs with cross-validated and/or weighted versions of Targeted Minimum Loss-based Estimators (TMLE) and One-Step Estimators (OSE). The effect of dependence amongst units on variance estimates, is corrected by utilising sieve plateau variance estimators based on a meaningful notion of unit relatedness. 

Our motivating application is the targeted estimation of causal genetic effects on trait, including two-point and higher-order gene-gene and gene-environment interactions, in large-scale genomic databases such as UK Biobank and All of Us. Computing millions of estimates in large cohorts in which small effect sizes are expected, necessitates minimising model-misspecification bias to control false discoveries. We report on significant findings, both replicated and novel contradicting overconfident findings from parametric linear mixed models commonly employed in statistical genomics.

All cross-validated and/or weighted TMLE and OSE for the AIE n-point interaction, as well as ATEs, CATEs and functions thereof, are implemented in the general purpose Julia package TMLE.jl. For high-throughput applications in population genomics, we provide the open source Nextflow pipeline and software TarGene that integrates seamlessly with modern high-performance and cloud computing platforms.

  • 11th December (joint with CREEM): Graeme MacGilchrist, University of St Andrews 

Title: Timescales and mechanisms of predictability in marine ecosystems

Abstract: Robust predictions of marine ecosystem health on interannual-to-decadal timescales would be valuable for ecosystem and fisheries management. Previous work has shown that important ecosystem parameters such as ocean temperature, primary production, and dissolved oxygen content, can have predictability time horizons up to several years. Here, we present results from a new suite of perfect model experiments run with GFDL’s ESM4 earth system model to assess the theoretical limits and mechanisms of predictability of the ocean’s biogeochemical state. We find that while the time horizon of predictability is several years in many oceanic regions, it is generally shorter than what was found in previous model generations. For net primary production, for example, the global average predictability time horizon is 14 months, in contrast to the 30+ months found in prior work. Thus, by comparing model generations, we are able to assess the impact of ocean circulation and biogeochemical complexity on the intrinsic variability and predictability of ocean ecosystems. Using ensemble initializations in different months and years, we consider the effect of both the seasonal cycle and modes of atmospheric variability (i.e. ENSO) on biogeochemical predictability. Finally, using high temporal resolution diagnostics, we assess limits in the temporal granularity at which robust predictions can be made, i.e. the sensitivity of predictions to the time-averaging of the target period (daily, weekly, monthly, yearly).

  • 29th January: Stephen Senn, Honorary Professor, University of St Andrews

This seminar will be in the Maths Institute, Lecture Theatre C. 

Title: Questions and answers from randomised clinical trials

Abstract: I consider five types of possible question that might be asked of a clinical trial and the answers that we might reasonably expect of them.

  • Q1. Was there an effect of treatment in this trial?
  • Q2. What was the average effect of treatment in this trial?
  • Q3. Was the treatment effect identical for all patients in the trial?
  • Q4. What was the effect of treatment for different subgroups of patients?
  • Q5. What will be the effect of treatment when used more generally (outside of the trial)?[1]

I consider the role of randomisation in addressing the first two, in particular where it is considered important to blind clinical trials and the general prospects for answering the other three. I argue that covariate balance is not what randomisation can be expected to deliver and if it did, the conventional analyses of clinical trials would be wrong.

I argue that representativeness of clinical trials is largely overplayed and that answers to the fifth type of question have more to do with reasonable mechanistic theory and less to do with “representativeness” than has been claimed[2].

Amongst various historical matters I shall cover are why Fisher was right on blinding and Bradford Hill was wrong, and what Yates and Cochran knew about experiments that we seem to have forgotten. I speculate that in teaching statistics we ought to pay more attention to data generating mechanisms.

References

  1. Senn, S.J., Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine, 2004. 23(24): p. 3729-3753.
  2. Uschner, D., et al., Using Randomization Tests to Address Disruptions in Clinical Trials: A Report from the NISS Ingram Olkin Forum Series on Unplanned Clinical Trial Disruptions. Statistics in Biopharmaceutical Research, 2023: p. 1-9.
  • 5 February: Jan-Ole Koslik, Bielefeld University

This seminar will be in the Maths Institute, Tutorial Room 1A at 13:00 – 14:00. 

Title: Efficient smoothness selection for nonparametric Markov-switching models via quasi restricted maximum likelihood estimation

Abstract: Markov-switching models are powerful tools that allow capturing complex patterns from time series data driven by latent states. Recent work has highlighted the benefits of estimating components of these models nonparametrically, enhancing their flexibility and reducing biases, which in turn can improve state decoding, forecasting, and overall inference. Formulating such models using penalised splines is straightforward, but practically feasible methods for a data-driven smoothness selection in these models are still lacking. Traditional techniques, such as cross-validation and information criteria-based selection suffer from major drawbacks, most importantly their reliance on computationally expensive grid search methods, hampering practical usability. Michelot (2022) suggested treating spline coefficients as random effects with a multivariate normal distribution and using the R package TMB (Kristensen et al., 2015) for marginal likelihood maximisation. While this method avoids grid search and typically results in adequate smoothness selection, it entails a nested optimisation problem, thus being computationally demanding. We propose to exploit the simple structure of penalised splines treated as random effects, thereby greatly reducing the computational burden while potentially improving fixed effects parameter estimation accuracy. The proposed method offers a reliable and efficient mechanism for smoothness selection, rendering the estimation of Markov-switching models involving penalised splines feasible for complex data structures.

  • 12 February: Ioana Colfescu, School of Earth & Environmental Sciences, University of St Andrews

Title: Bridging Disciplines: A conversation on harnessing machine learning and big data for effective climate change research based solutions

Abstract:The seminar aims to introduce the newly formed group of the National Centre for Atmospheric Science (NCAS) at St. Andrews University to the Centre for Research into Ecological and Environmental Modelling (CREEM), with the goal of fostering new avenues for joint research between these two St. Andrews-based research centers.

The presentation will first outline the nature and role of NCAS within the UK research landscape, emphasizing its commitment to understanding the atmosphere, the changes it undergoes, and the resultant impacts on life on Earth. It will further explore the use of big data and atmospheric science techniques, with focus on new digital techniques (i.e. AI and ML) to tackle various research challenges, focusing on how these methods can facilitate a multidisciplinary approach to addressing climate change impacts, particularly on ecosystems.

  • 19 February: Sarah Wade, University of Edinburgh

Title: Understanding uncertainty in Bayesian cluster analysis

Abstract: The Bayesian approach to clustering is often appreciated for its ability to provide uncertainty in the partition structure. However, summarizing the posterior distribution over the clustering structure can be challenging. Wade and Ghahramani (2018) proposed to summarize the posterior samples using a single optimal clustering estimate, which minimizes the expected posterior Variation of Information (VI). In instances where the posterior distribution is multimodal, it can be beneficial to summarize the posterior samples using multiple clustering estimates, each corresponding to a different part of the space of partitions that receives substantial posterior mass. In this work, we propose to find such clustering estimates by approximating the posterior distribution in a VI-based Wasserstein distance sense. An interesting byproduct is that this problem can be seen as using the k-mediods algorithm to divide the posterior samples into different groups, each represented by one of the clustering estimates. Using both synthetic and real datasets, we show that our proposal helps to improve the understanding of uncertainty, particularly when the data clusters are not well separated, or when the employed model is misspecified.

  • 12 March: William Smith

Title: Doctoral Students as Carbon Accountants: Calculating Carbon Costs of a PhD in Neuroscience

Abstract: PhD students are drivers of innovation in research; however, the carbon intensity of PhD work is often unclear, especially in specialised STEM disciplines. Over 250,000 doctoral students graduate annually across all academic disciplines; empowering this community to engage in carbon accounting could create a generational force for decarbonization in key areas of production and consumption in research communities. Here, we demonstrate how doctoral students and other researchers can measure the carbon footprint of their work, using one PhD student in a Drosophila neuroscience laboratory as a case study. We propose a common framework for including carbon life cycle analyses in Carbon Appendices to PhD theses and other publications. We envision doctoral students carrying insights from Carbon Appendices forward into academia and industry to catalyse community-driven decarbonisation of the research sector.

  • 9 April: Catriona Harris, CREEM, University of St Andrews

Title: Marine mammals and sonar: The past, present, and future of behavioural response studies

Abstract: Research into the behavioural responses of marine mammals to naval sonar exposure has been funded by US and European Navies for around two decades.  As behavioural response studies (BRS) have evolved over this time period, with new technologies and study designs, the analytical challenges have also evolved.  I will talk through some of our solutions to these challenges, particularly as they relate to animal-borne tag data, the detection of behavioural responses, and relating responses to levels of sound exposure.  I will also give a summary of a recent review of the status of BRS science and highlight some of the outstanding analytical challenges.

  • 16 April: Simon Wood, University of Edinburgh

Title: Covid, Risk and Statistics

Abstract: In many respects the Covid pandemic upended the usual evidence based approach to public health in favour of a rapidly developed alternative to risk communication and assessment that would previously, or in other contexts, have been viewed as unusual and potentially damaging.  This talk discusses some of the statistical deficiencies of the Covid consensus that emerged, in particular with regard to Covid and economically mediated heath risks, excess deaths, epidemic modelling and the necessity or otherwise of lockdowns.

Previous academic years

Seminars from previous academic years (since 2022) are listed here.