Seminars
Statistics seminars are held on Wednesdays 14:00 – 15:00. Everyone is welcome! We gather for coffee/tea and biscuits around 15 minutes before the seminar begins.
The organiser is Ben Baer. Please contact Ben to find out more about the seminars, to suggest a future seminar speaker, or to request joining seminars online.
Most of the seminars this year will be held in-person and few online. The in-person seminars will be held at the Observatory seminar room. Please see below for more details.
Forthcoming statistics seminars 2024-25
Semester 1
- 11th December (joint with CREEM): Graeme MacGilchrist, University of St Andrews
Title: Timescales and mechanisms of predictability in marine ecosystems
Abstract: Robust predictions of marine ecosystem health on interannual-to-decadal timescales would be valuable for ecosystem and fisheries management. Previous work has shown that important ecosystem parameters such as ocean temperature, primary production, and dissolved oxygen content, can have predictability time horizons up to several years. Here, we present results from a new suite of perfect model experiments run with GFDL’s ESM4 earth system model to assess the theoretical limits and mechanisms of predictability of the ocean’s biogeochemical state. We find that while the time horizon of predictability is several years in many oceanic regions, it is generally shorter than what was found in previous model generations. For net primary production, for example, the global average predictability time horizon is 14 months, in contrast to the 30+ months found in prior work. Thus, by comparing model generations, we are able to assess the impact of ocean circulation and biogeochemical complexity on the intrinsic variability and predictability of ocean ecosystems. Using ensemble initializations in different months and years, we consider the effect of both the seasonal cycle and modes of atmospheric variability (i.e. ENSO) on biogeochemical predictability. Finally, using high temporal resolution diagnostics, we assess limits in the temporal granularity at which robust predictions can be made, i.e. the sensitivity of predictions to the time-averaging of the target period (daily, weekly, monthly, yearly).
Semester 2
- 29th January: Stephen Senn, Honorary Professor, University of St Andrews
Title: Questions and answers from randomised clinical trials
Abstract: I consider five types of possible question that might be asked of a clinical trial and the answers that we might reasonably expect of them.
- Q1. Was there an effect of treatment in this trial?
- Q2. What was the average effect of treatment in this trial?
- Q3. Was the treatment effect identical for all patients in the trial?
- Q4. What was the effect of treatment for different subgroups of patients?
- Q5. What will be the effect of treatment when used more generally (outside of the trial)?[1]
I consider the role of randomisation in addressing the first two, in particular where it is considered important to blind clinical trials and the general prospects for answering the other three. I argue that covariate balance is not what randomisation can be expected to deliver and if it did, the conventional analyses of clinical trials would be wrong.
I argue that representativeness of clinical trials is largely overplayed and that answers to the fifth type of question have more to do with reasonable mechanistic theory and less to do with “representativeness” than has been claimed[2].
Amongst various historical matters I shall cover are why Fisher was right on blinding and Bradford Hill was wrong, and what Yates and Cochran knew about experiments that we seem to have forgotten. I speculate that in teaching statistics we ought to pay more attention to data generating mechanisms.
The lecture will be followed by a reception to welcome Professor Senn to the School
References
- Senn, S.J., Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine, 2004. 23(24): p. 3729-3753.
- Uschner, D., et al., Using Randomization Tests to Address Disruptions in Clinical Trials: A Report from the NISS Ingram Olkin Forum Series on Unplanned Clinical Trial Disruptions. Statistics in Biopharmaceutical Research, 2023: p. 1-9.
Past seminars
This academic year
- 18th September (joint with CREEM): Hannah Worthington, University of St Andrews.
Title: Capture-recapture Models: A lifetime expectation perspective
Abstract: Capture-recapture(-recovery) models featuring time- and age-dependent parameters are commonly used to offer biologically reasonable structures for features of a population. In particular, survival probabilities are often strongly linked to age, for example, showing high mortality in young and old individuals, or different survival probabilities for different age classes (e.g. first-year, sub-adult, breeding adult etc.). Unfortunately, fully-age dependent models which allow for a different probability of survival in each year of life tend to result in estimating very large numbers of parameters. We propose taking a semi-Markov approach to offer a straightforward mechanism to include an age-component to survival whilst requiring far fewer parameters. Instead of considering this problem from the perspective of survival from one year to the next, we instead consider the distribution for the age of death. Adding in additional temporal elements however, to account for adverse or favourable environmental conditions, creates some difficulties. I’ll present our current ideas that look to embed a random walk structure into the model to overcome these challenges.
- 25th September: Nguyen Dang, University of St Andrews
Title: Reinforcement Learning for Dynamic Algorithm Configuration.
Abstract: Most algorithms have their own parameters that need to be tuned to achieve the best performance. In some cases, instead of finding the best static parameter setting for an algorithm, it is highly beneficial to adapt the parameter values while the algorithm is running. Dynamic Algorithm Configuration (DAC) focuses on developing techniques to solve this task in an automated and data-driven fashion. The aim is to learn a policy that map from the current state of the algorithm to the best parameter value for that state during the solving process. DAC is an emerging topic and has a lot of potential applications in various domains. Given the dynamic nature of the task, Reinforcement Learning (RL) seems like a suitable family of techniques for tackling DAC problems. However, research on DAC methods is still in its early stage. It is not clear whether RL methods, which were original developed for other domain applications such as robotics and game playing, are effective in DAC contexts. In this talk, I will give a brief introduction to DAC and present our recent study on benchmarking a commonly-used RL algorithm on DAC.
- 23rd October: Rui Borges, University of St Andrews
Title: A rant about mutation models in population genetics
Abstract: Mutations are essential drivers of evolution, and their mathematical modeling in population genetics depends on how we perceive their frequency and the timescales at which they occur. A common assumption is that mutations are rare, and by the time a new mutation arises, the previous one has either been fixed or lost from the population. However, more realistic models should account for reversible or even recurrent mutations. In this talk, I compare different mutation models, focusing on their implications for two very important inferential tasks in evolutionary biology: estimating effective population sizes and reconstructing phylogenies. Finally, I will introduce the concept of the distribution of fitness effects, highlight its fundamental role in molecular evolution in describing the fate of new mutations, and discuss my current approach to infer this distribution using genomic data.
- 30th October (joint with CREEM): Simon Wood, University of Edinburgh
Title: Neighbourhood Cross Validation and modelling under spatial correlation without a spatial correlation model.
Abstract: Cross validation comes in many varieties, but some of the more interesting flavours require multiple model fits with consequently high cost. This talk shows how the high cost can be side-stepped for a wide range of models estimated using a quadratically penalized smooth loss, with rather low approximation error. Once the computational cost has the same leading order as a single model fit, it becomes feasible to efficiently optimize the chosen cross-validation criterion with respect to multiple smoothing/precision parameters. Interesting applications include cross-validating smooth additive quantile regression models, and the use of leave-out-neighbourhood cross validation for dealing with nuisance short range autocorrelation. The link between cross validation and the jackknife can be exploited to obtain reasonably well calibrated uncertainty quantification in these cases.
- 6th November (joint with CREEM): Andrew Solow, Woods Hole Oceanographic Institute
Title: The use of sighting records in ecology
Abstract: This talk presents three examples of the use of sighting records of individual animals to address ecological issues: population declines in the Yangtze river dolphin, the extinction of the Ivory-billed Woodpecker, and the rediscovery of the polecat in Scotland. Technical material will be kept to a reasonable minimum.
- 13th November (joint with CREEM): Regina Bispo, University of St Andrews
Title: Breezes, Blazes, and Stats: A Research Journey
Abstract: In this talk, I will summarize my research on estimating wildlife fatalities at onshore wind farms and, more recently, on modelling the occurrence of both urban and rural fires.
Understanding the impact of onshore wind farms on avian and bat populations requires mortality estimation. In this context, we want to estimate the number of deaths driven by collision with the wind farm structures. Mortality assessment is typically based on counting detected carcasses underneath turbines. However, there are several sources of uncertainty, including carcass removal (e.g., by scavengers) and the observers’ detection ability. Moreover, mortality rates vary across space and time, influenced by turbine placement and changing collision risks.
Urban fires remain a major threat, contributing to property damage, physical injury, and loss of life. High population density and socio-economic factors can further amplify fire risk and firefighting costs. On the other hand, wildfires represent a global challenge. In Portugal, despite a declining trend in the number of rural fires, the total burned area has increased in the last years, reaching 110,097 hectares in 2022. This shift is tied to a rise in large, intense fires, which are often linked to climate change and result in more extensive environmental impact and higher socio-economic costs. I will conclude my presentation by sharing some recent ongoing work on modelling the occurrence and size of rural fires in Portugal.
- 20th November: Sara Wade (rescheduled)
Abstract: The Bayesian approach to clustering is often appreciated for its ability to provide uncertainty in the partition structure. However, summarizing the posterior distribution over the clustering structure can be challenging. Wade and Ghahramani (2018) proposed to summarize the posterior samples using a single optimal clustering estimate, which minimizes the expected posterior Variation of Information (VI). In instances where the posterior distribution is multimodal, it can be beneficial to summarize the posterior samples using multiple clustering estimates, each corresponding to a different part of the space of partitions that receives substantial posterior mass. In this work, we propose to find such clustering estimates by approximating the posterior distribution in a VI-based Wasserstein distance sense. An interesting byproduct is that this problem can be seen as using the k-mediods algorithm to divide the posterior samples into different groups, each represented by one of the clustering estimates. Using both synthetic and real datasets, we show that our proposal helps to improve the understanding of uncertainty, particularly when the data clusters are not well separated, or when the employed model is misspecified.
- 27th November (joint with CREEM): April Zhou, Lancaster University
Title: Using Simulation Optimisation to Solve the Reserve Site Selection Problem
Abstract: The Reserve Site Selection (RSS) problem aims to select a combination of sites from potential sites to assemble a reserve that meets specific conservation goals. Traditionally, it is formulated as a mathematical programming problem, which often fails to capture the complexity of ecosystems. Stochastic simulation models can help capture this complexity, but they are typically used in an exploratory way rather than for finding optimal solutions. Simulation Optimisation (SO) overcomes the challenges of both methods by finding optimal solutions via stochastic simulations.
In our research, we formulate the RSS problem as an SO problem with the goal of finding the best combination of sites that not only minimises cost but also ensures that species survival probabilities meet desired thresholds. We use the grey wolf (Canis lupus) as a case study to examine the performance of SO in solving RSS problems.
This talk will cover the problem formulation, the solution methods, and two enhancements aimed at improving these methods.
- 4th December: Sjoerd Victor Beentjes, University of Edinburgh
Title: Semi-parametric efficient estimation of small genetic effects in large-scale population cohorts
Abstract: We present a unified statistical workflow for the semiparametric efficient and double robust estimation of n-point interactions amongst categorical variables in the presence of confounding and weak population dependence. N-point interactions, or Average Interaction Effects (AIEs), are a direct generalisation of the usual average treatment effect (ATE). We estimate AIEs with cross-validated and/or weighted versions of Targeted Minimum Loss-based Estimators (TMLE) and One-Step Estimators (OSE). The effect of dependence amongst units on variance estimates, is corrected by utilising sieve plateau variance estimators based on a meaningful notion of unit relatedness.
Our motivating application is the targeted estimation of causal genetic effects on trait, including two-point and higher-order gene-gene and gene-environment interactions, in large-scale genomic databases such as UK Biobank and All of Us. Computing millions of estimates in large cohorts in which small effect sizes are expected, necessitates minimising model-misspecification bias to control false discoveries. We report on significant findings, both replicated and novel contradicting overconfident findings from parametric linear mixed models commonly employed in statistical genomics.
All cross-validated and/or weighted TMLE and OSE for the AIE n-point interaction, as well as ATEs, CATEs and functions thereof, are implemented in the general purpose Julia package TMLE.jl. For high-throughput applications in population genomics, we provide the open source Nextflow pipeline and software TarGene that integrates seamlessly with modern high-performance and cloud computing platforms.
Previous academic years
Seminars from previous academic years (since 2022) are listed here.