publications | Lorenzo Pacchiardi

2025

arXiv
Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Irene Testini* , José Hernández-Orallo , and Lorenzo Pacchiardi*

arXiv preprint arXiv:2506.08800, 2025

Abs Bib HTML

Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) are increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances—such as code execution and knowledge bases—that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.
@article{testini2025measuringdatascienceautomation, title = {{Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents}}, author = {Testini*, Irene and Hernández-Orallo, José and Pacchiardi*, Lorenzo}, year = {2025}, eprint = {2506.08800}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, journal = {arXiv preprint arXiv:2506.08800}, url = {https://arxiv.org/abs/2506.08800}, }
arXiv
General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Lexin Zhou , Lorenzo Pacchiardi , Fernando Martínez-Plumed , Katherine M. Collins , Yael Moros-Daval , Seraphina Zhang , Qinlin Zhao , Yitian Huang , Luning Sun , Jonathan E. Prunty , Zongqian Li , Pablo Sánchez-García , Kexin Jiang Chen , Pablo A. M. Casares , Jiyun Zu , John Burden , Behzad Mehrbakhsh , David Stillwell , Manuel Cebrian , Jindong Wang , Peter Henderson , Sherry Tongshuang Wu , Patrick C. Kyllonen , Lucy Cheke , Xing Xie , and José Hernández-Orallo

arXiv preprint arXiv:2503.06378, 2025

Abs Bib HTML

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead.
@article{zhou2025generalscalesunlockai, title = {{General Scales Unlock AI Evaluation with Explanatory and Predictive Power}}, author = {Zhou, Lexin and Pacchiardi, Lorenzo and Martínez-Plumed, Fernando and Collins, Katherine M. and Moros-Daval, Yael and Zhang, Seraphina and Zhao, Qinlin and Huang, Yitian and Sun, Luning and Prunty, Jonathan E. and Li, Zongqian and Sánchez-García, Pablo and Chen, Kexin Jiang and Casares, Pablo A. M. and Zu, Jiyun and Burden, John and Mehrbakhsh, Behzad and Stillwell, David and Cebrian, Manuel and Wang, Jindong and Henderson, Peter and Wu, Sherry Tongshuang and Kyllonen, Patrick C. and Cheke, Lucy and Xie, Xing and Hernández-Orallo, José}, year = {2025}, eprint = {2503.06378}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, journal = {arXiv preprint arXiv:2503.06378}, url = {https://kinds-of-intelligence-cfi.github.io/ADELE/}, }
ACL Findings
PredictaBoard: Benchmarking LLM Score Predictability

Lorenzo Pacchiardi , Konstantinos Voudouris , Ben Slater , Fernando Martínez-Plumed , José Hernández-Orallo , Lexin Zhou , and Wout Schellaert

Findings of the Association for Computational Linguistics: ACL 2025, 2025

Abs Bib HTML Code

Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated.
@article{pacchiardi2025predictaboardbenchmarkingllmscore, title = {{PredictaBoard}: Benchmarking {LLM} Score Predictability}, author = {Pacchiardi, Lorenzo and Voudouris, Konstantinos and Slater, Ben and Martínez-Plumed, Fernando and Hernández-Orallo, José and Zhou, Lexin and Schellaert, Wout}, year = {2025}, eprint = {2502.14445}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, journal = {Findings of the Association for Computational Linguistics: ACL 2025}, url = {https://predictaboard.github.io/}, dataset = {https://huggingface.co/collections/kvoudouris/predictaboard-67b6042ee09a99a3b0bbebd0} }
IJCAI
Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture

John Burden* , Marko Tešić* , Lorenzo Pacchiardi* , and José Hernández-Orallo

IJCAI 2025 Survey Track, 2025

Abs Bib HTML

Research in AI evaluation has grown increasingly complex and multidisciplinary, attracting researchers with diverse backgrounds and objectives. As a result, divergent evaluation paradigms have emerged, often developing in isolation, adopting conflicting terminologies, and overlooking each other’s contributions. This fragmentation has led to insular research trajectories and communication barriers both among different paradigms and with the general public, contributing to unmet expectations for deployed AI systems. To help bridge this insularity, in this paper we survey recent work in the AI evaluation landscape and identify six main paradigms. We characterise major recent contributions within each paradigm across key dimensions related to their goals, methodologies and research cultures. By clarifying the unique combination of questions and approaches associated with each paradigm, we aim to increase awareness of the breadth of current evaluation approaches and foster cross-pollination between different paradigms. We also identify potential gaps in the field to inspire future research directions.
@article{burden2025paradigmsaievaluationmapping, title = {Paradigms of {AI} Evaluation: Mapping Goals, Methodologies and Culture}, author = {Burden*, John and Tešić*, Marko and Pacchiardi*, Lorenzo and Hernández-Orallo, José}, year = {2025}, eprint = {2502.15620}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, journal = {IJCAI 2025 Survey Track}, url = {https://arxiv.org/abs/2502.15620}, }

2024

arXiv
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers

Lorenzo Pacchiardi* , Marko Tesic* , Lucy G. Cheke , and José Hernández-Orallo

arXiv preprint arXiv:2410.11672, 2024

Abs Bib HTML Code

The integrity of AI benchmarks is fundamental to accurately assess the capabilities of AI systems. The internal validity of these benchmarks - i.e., making sure they are free from confounding factors - is crucial for ensuring that they are measuring what they are designed to measure. In this paper, we explore a key issue related to internal validity: the possibility that AI systems can solve benchmarks in unintended ways, bypassing the capability being tested. This phenomenon, widely known in human and animal experiments, is often referred to as the ’Clever Hans’ effect, where tasks are solved using spurious cues, often involving much simpler processes than those putatively assessed. Previous research suggests that language models can exhibit this behaviour as well. In several older Natural Language Processing (NLP) benchmarks, individual n-grams like "not" have been found to be highly predictive of the correct labels, and supervised NLP models have been shown to exploit these patterns. In this work, we investigate the extent to which simple n-grams extracted from benchmark instances can be combined to predict labels in modern multiple-choice benchmarks designed for LLMs, and whether LLMs might be using such n-gram patterns to solve these benchmarks. We show how simple classifiers trained on these n-grams can achieve high scores on several benchmarks, despite lacking the capabilities being tested. Additionally, we provide evidence that modern LLMs might be using these superficial patterns to solve benchmarks. This suggests that the internal validity of these benchmarks may be compromised and caution should be exercised when interpreting LLM performance results on them.
@article{pacchiardi2024leavingbarndooropen, title = {Leaving the barn door open for {C}lever {H}ans: Simple features predict {LLM} benchmark answers}, author = {Pacchiardi*, Lorenzo and Tesic*, Marko and Cheke, Lucy G. and Hern\'{a}ndez-Orallo, Jos\'{e}}, year = {2024}, eprint = {2410.11672}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, journal = {arXiv preprint arXiv:2410.11672}, url = {https://arxiv.org/abs/2410.11672}, }
arXiv
100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances

Lorenzo Pacchiardi , Lucy G. Cheke , and José Hernández-Orallo

arXiv preprint arXiv:2409.03563, 2024

Abs Bib HTML Code

Predicting the performance of LLMs on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, a possibility is to evaluate the considered LLM on a set of task instances and train an assessor to predict its performance based on features of the instances. However, this approach requires evaluating each new LLM on a sufficiently large set of task instances to train an assessor specific to it. In this work, we leverage the evaluation results of previously tested LLMs to reduce the number of evaluations required to predict the performance of a new LLM. In practice, we propose to test the new LLM on a small set of reference instances and train a generic assessor which predicts the performance of the LLM on an instance based on the performance of the former on the reference set and features of the instance of interest. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until the January 2024 version of GPT4. When predicting performance on instances with the same distribution as those used to train the generic assessor, we find this achieves performance comparable to the LLM-specific assessors trained on the full set of instances. Additionally, we find that randomly selecting the reference instances performs as well as some advanced selection methods we tested. For out of distribution, however, no clear winner emerges and the overall performance is worse, suggesting that the inherent predictability of LLMs is low.
@article{pacchiardi2024100instancesneedpredicting, title = {100 instances is all you need: predicting the success of a new {LLM} on unseen data by testing on a few instances}, author = {Pacchiardi, Lorenzo and Cheke, Lucy G. and Hern\'{a}ndez-Orallo, Jos\'{e}}, year = {2024}, eprint = {2409.03563}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, journal = {arXiv preprint arXiv:2409.03563}, url = {https://arxiv.org/abs/2409.03563}, }
ICLR 2024
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Lorenzo Pacchiardi* , Alex J Chan* , Sören Mindermann , Ilan Moscovitz , Alexa Y Pan , Yarin Gal , Owain Evans , and Jan Brauner

The Twelfth International Conference on Learning Representations, 2024

Abs Bib HTML Code

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting – prompting GPT-3.5 to lie about factual questions – the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
@article{pacchiardi2023catch, title = {How to Catch an {AI} Liar: Lie Detection in Black-Box {LLM}s by Asking Unrelated Questions}, author = {Pacchiardi*, Lorenzo and Chan*, Alex J and Mindermann, S{\"o}ren and Moscovitz, Ilan and Pan, Alexa Y and Gal, Yarin and Evans, Owain and Brauner, Jan}, journal = {The Twelfth International Conference on Learning Representations}, year = {2024}, url = {https://openreview.net/forum?id=567BjxgaTp}, }
JMLR
Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization

Lorenzo Pacchiardi , Rilwan Adewoyin , Peter Dueben , and Ritabrata Dutta

Journal of Machine Learning Research, 2024

Abs Bib HTML Code

Probabilistic forecasting relies on past observations to provide a probability distribution for a future outcome, which is often evaluated against the realization using a scoring rule. Here, we perform probabilistic forecasting with generative neural networks, which parametrize distributions on high-dimensional spaces by transforming draws from a latent variable. Generative networks are typically trained in an adversarial framework. In contrast, we propose to train generative networks to minimize a predictive-sequential (or prequential) scoring rule on a recorded temporal sequence of the phenomenon of interest, which is appealing as it corresponds to the way forecasting systems are routinely evaluated. Adversarial-free minimization is possible for some scoring rules; hence, our framework avoids the cumbersome hyperparameter tuning and uncertainty underestimation due to unstable adversarial training, thus unlocking reliable use of generative networks in probabilistic forecasting. Further, we prove consistency of the minimizer of our objective with dependent data, while adversarial training assumes independence. We perform simulation studies on two chaotic dynamical models and a benchmark data set of global weather observations; for this last example, we define scoring rules for spatial data by drawing from the relevant literature. Our method outperforms state-of-the-art adversarial approaches, especially in probabilistic calibration, while requiring less hyperparameter tuning.
@article{pacchiardi2021probabilistic, author = {Pacchiardi, Lorenzo and Adewoyin, Rilwan and Dueben, Peter and Dutta, Ritabrata}, title = {Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization}, journal = {Journal of Machine Learning Research}, year = {2024}, volume = {25}, number = {45}, pages = {1-64}, url = {https://jmlr.org/papers/v25/23-0038.html}, }
EJS
Generalized Bayesian likelihood-free inference

Lorenzo Pacchiardi , Sherman Khoo , and Ritabrata Dutta

Electronic Journal of Statistics, 2024

Abs Bib HTML Code

Generalized Bayesian inference replaces the likelihood in the Bayesian posterior with the exponential of a loss function connecting parameter values and observations. As a loss function, it is possible to use Scoring Rules (SRs), which evaluate the match between the observation and the probabilistic model for given parameter values. In this work, we leverage this Scoring Rule posterior for Bayesian Likelihood-Free Inference (LFI). In LFI, we can sample from the model but not evaluate the likelihood; hence, we use the Energy and Kernel SRs in the SR posterior, as they admit unbiased empirical estimates. While traditional Pseudo-Marginal (PM) Markov Chain Monte Carlo (MCMC) can be applied to the SR posterior, it mixes poorly for concentrated targets, such as those obtained with many observations. As such, we propose to use Stochastic Gradient (SG) MCMC, which improves performance over PM-MCMC and scales to higher-dimensional setups as it is rejection-free. SG-MCMC requires differentiating the simulator model; we achieve this effortlessly by implementing the simulator models using automatic differentiation libraries. We compare SG-MCMC sampling for the SR posterior with related LFI approaches and find that the former scales to larger sample sizes and works well on the raw data, while other methods require determining suitable summary statistics. On a chaotic dynamical system from meteorology, our method even allows inferring the parameters of a neural network used to parametrize a part of the update equations.
@article{10.1214/24-EJS2283, author = {Pacchiardi, Lorenzo and Khoo, Sherman and Dutta, Ritabrata}, title = {{Generalized Bayesian likelihood-free inference}}, volume = {18}, journal = {Electronic Journal of Statistics}, number = {2}, publisher = {Institute of Mathematical Statistics and Bernoulli Society}, pages = {3628 -- 3686}, keywords = {generalized Bayes, likelihood-free inference, pseudo-marginal MCMC, scoring rules}, year = {2024}, doi = {10.1214/24-EJS2283}, url = {https://doi.org/10.1214/24-EJS2283}, }

2022

arXiv
Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization

Lorenzo Pacchiardi , and Ritabrata Dutta

arXiv preprint arXiv:2205.15784, 2022

Abs Bib HTML Code

Bayesian Likelihood-Free Inference methods yield posterior approximations for simulator models with intractable likelihood. Recently, many works trained neural networks to approximate either the intractable likelihood or the posterior directly. Most proposals use normalizing flows, namely neural networks parametrizing invertible maps used to transform samples from an underlying base measure; the probability density of the transformed samples is then accessible and the normalizing flow can be trained via maximum likelihood on simulated parameter-observation pairs. A recent work [Ramesh et al., 2022] approximated instead the posterior with generative networks, which drop the invertibility requirement and are thus a more flexible class of distributions scaling to high-dimensional and structured data. However, generative networks only allow sampling from the parametrized distribution; for this reason, Ramesh et al. [2022] follows the common solution of adversarial training, where the generative network plays a min-max game against a "critic" network. This procedure is unstable and can lead to a learned distribution underestimating the uncertainty - in extreme cases collapsing to a single point. Here, we propose to approximate the posterior with generative networks trained by Scoring Rule minimization, an overlooked adversarial-free method enabling smooth training and better uncertainty quantification. In simulation studies, the Scoring Rule approach yields better performances with shorter training time with respect to the adversarial framework.
@article{pacchiardi2022likelihood, title = {Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization}, author = {Pacchiardi, Lorenzo and Dutta, Ritabrata}, journal = {arXiv preprint arXiv:2205.15784}, year = {2022}, eprint = {2205.15784}, archiveprefix = {arXiv}, url = {https://arxiv.org/abs/2205.15784}, doi = {10.48550/ARXIV.2205.15784}, primaryclass = {stat.CO}, keywords = {Computation (stat.CO), Machine Learning (cs.LG), Methodology (stat.ME), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences}, publisher = {arXiv}, copyright = {arXiv.org perpetual, non-exclusive license}, }
JMLR
Score Matched Neural Exponential Families for Likelihood-Free Inference

Lorenzo Pacchiardi , and Ritabrata Dutta

Journal of Machine Learning Research, 2022

Abs Bib HTML Code

Bayesian Likelihood-Free Inference (LFI) approaches allow to obtain posterior distributions for stochastic models with intractable likelihood, by relying on model simulations. In Approximate Bayesian Computation (ABC), a popular LFI method, summary statistics are used to reduce data dimensionality. ABC algorithms adaptively tailor simulations to the observation in order to sample from an approximate posterior, whose form depends on the chosen statistics. In this work, we introduce a new way to learn ABC statistics: we first generate parameter-simulation pairs from the model independently on the observation; then, we use Score Matching to train a neural conditional exponential family to approximate the likelihood. The exponential family is the largest class of distributions with fixed-size sufficient statistics; thus, we use them in ABC, which is intuitively appealing and has state-of-the-art performance. In parallel, we insert our likelihood approximation in an MCMC for doubly intractable distributions to draw posterior samples. We can repeat that for any number of observations with no additional model simulations, with performance comparable to related approaches. We validate our methods on toy models with known likelihood and a large-dimensional time-series model.
@article{pacchiardi2020score, author = {Pacchiardi, Lorenzo and Dutta, Ritabrata}, title = {Score Matched Neural Exponential Families for Likelihood-Free Inference}, journal = {Journal of Machine Learning Research}, year = {2022}, volume = {23}, number = {38}, pages = {1-71}, url = {http://jmlr.org/papers/v23/21-0061.html}, }

2021

PLOS Comp. Biol.
Using Mobility Data in the Design of Optimal Lockdown Strategies for the COVID-19 Pandemic

Ritabrata Dutta , Susana Gomes , Dante Kalise , and Lorenzo Pacchiardi

PLOS Computational Biology, 2021

Abs Bib HTML Code

A mathematical model for the COVID-19 pandemic spread, which integrates age-structured Susceptible-Exposed-Infected-Recovered-Deceased dynamics with real mobile phone data accounting for the population mobility, is presented. The dynamical model adjustment is performed via Approximate Bayesian Computation. Optimal lockdown and exit strategies are determined based on nonlinear model predictive control, constrained to public-health and socio-economic factors. Through an extensive computational validation of the methodology, it is shown that it is possible to compute robust exit strategies with realistic reduced mobility values to inform public policy making, and we exemplify the applicability of the methodology using datasets from England and France.
@article{dutta2021using, title = {Using Mobility Data in the Design of Optimal Lockdown Strategies for the COVID-19 Pandemic}, author = {Dutta, Ritabrata and Gomes, Susana and Kalise, Dante and Pacchiardi, Lorenzo}, journal = {PLOS Computational Biology}, volume = {17}, number = {8}, pages = {e1009236}, year = {2021}, publisher = {Public Library of Science San Francisco, CA USA}, url = {https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009236}, }
JSS
ABCpy: A High-Performance Computing Perspective to Approximate Bayesian Computation

Ritabrata Dutta , Marcel Schoengens , Lorenzo Pacchiardi , Avinash Ummadisingu , Nicole Widmer , Pierre Künzli , Jukka-Pekka Onnela , and Antonietta Mira

Journal of Statistical Software, 2021

Abs Bib HTML Code

ABCpy is a highly modular scientific library for approximate Bayesian computation (ABC) written in Python. The main contribution of this paper is to document a software engineering effort that enables domain scientists to easily apply ABC to their research without being ABC experts; using ABCpy they can easily run large parallel simulations without much knowledge about parallelization. Further, ABCpy enables ABC experts to easily develop new inference schemes and evaluate them in a standardized environment and to extend the library with new algorithms. These benefits come mainly from the modularity of ABCpy. We give an overview of the design of ABCpy and provide a performance evaluation concentrating on parallelization. This points us towards the inherent imbalance in some of the ABC algorithms. We develop a dynamic scheduling MPI implementation to mitigate this issue and evaluate the various ABC algorithms according to their adaptability towards high-performance computing.
@article{JSSABCpy, title = {ABCpy: A High-Performance Computing Perspective to Approximate Bayesian Computation}, author = {Dutta, Ritabrata and Schoengens, Marcel and Pacchiardi, Lorenzo and Ummadisingu, Avinash and Widmer, Nicole and K{\"u}nzli, Pierre and Onnela, Jukka-Pekka and Mira, Antonietta}, volume = {100}, url = {https://www.jstatsoft.org/index.php/jss/article/view/v100i07}, doi = {10.18637/jss.v100.i07}, number = {7}, journal = {Journal of Statistical Software}, year = {2021}, pages = {1--38}, }

2020

Sankhya B
Distance-Learning for Approximate Bayesian Computation to Model a Volcanic Eruption

Lorenzo Pacchiardi , Pierre Künzli , Marcel Schoengens , Bastien Chopard , and Ritabrata Dutta

Sankhya B, 2020

Abs Bib HTML Code

Approximate Bayesian computation (ABC) provides us with a way to infer parameters of models, for which the likelihood function is not available, from an observation. Using ABC, which depends on many simulations from the considered model, we develop an inferential framework to learn parameters of a stochastic numerical simulator of volcanic eruption. Moreover, the model itself is parallelized using Message Passing Interface (MPI). Thus, we develop a nested-parallelized MPI communicator to handle the expensive numerical model with ABC algorithms. ABC usually relies on summary statistics of the data in order to measure the discrepancy model output and observation. However, informative summary statistics cannot be found for the considered model. We therefore develop a technique to learn a distance between model outputs based on deep metric-learning. We use this framework to learn the plume characteristics (eg. initial plume velocity) of the volcanic eruption from the tephra deposits collected by field-work associated with the 2450 BP Pululagua (Ecuador) volcanic eruption.
@article{pacchiardi2020distance, title = {Distance-Learning for Approximate Bayesian Computation to Model a Volcanic Eruption}, author = {Pacchiardi, Lorenzo and K{\"u}nzli, Pierre and Schoengens, Marcel and Chopard, Bastien and Dutta, Ritabrata}, journal = {Sankhya B}, pages = {1--30}, year = {2020}, publisher = {Springer}, url = {https://link.springer.com/article/10.1007/s13571-019-00208-8}, }