Lorenzo Pacchiardi

Research Associate, University of Cambridge

I am a Research Associate at the Leverhulme Centre for the Future of Intelligence at the University of Cambridge. I lead a research project (funded by Open Philanthropy) on developing a benchmark for measuring the ability of LLMs to perform data science tasks. I am more broadly interested in AI evaluation, particularly in predictability and cognitive evaluation, and I closely collaborate with Prof José Hernández-Orallo and Dr Lucy Cheke. I contribute to the AI evaluation newsletter.

I previously worked on detecting lying in large language models with Dr Owain Evans and on technical standards for AI for the EU AI Act at the Future of Life Institute. I am deeply interested in AI policy (particularly at the EU level; I participate to the GPAI code of practice drafting process). I also collaborate with The Unjournal to make impactful research more rigorous.

I obtained a PhD in Statistics and Machine Learning at Oxford, during which I worked on Bayesian simulation-based inference, generative models and probabilistic forecasting (with applications to meteorology). My supervisors were Prof. Ritabrata Dutta (Uni. Warwick) and Prof. Geoff Nicholls (Uni. Oxford).

Before my PhD studies, I obtained a Bachelor’s degree in Physical Engineering from Politecnico di Torino (Italy) and an MSc in Physics of Complex Systems from Politecnico di Torino and Université Paris-Sud, France. I carried out my MSc thesis at LightOn, a machine learning startup in Paris.

news

May 16, 2025	Our survey on AI evaluation was accepted at IJCAI 2025 survey track and our PredictaBoard was accepted at ACL 2025 Findings.
Mar 11, 2025	Our new preprint shows how to extract the most predictive and explanatory power from AI benchmarks by automatically annotating the demands posed by each question. Check it out!
Feb 21, 2025	Two new arXiv preprints: one surveying AI evaluation and identifying six main paradigms, the other one introducing a benchmark for jointly evaluating the performance of LLMs and its predictability on individual instances.
Oct 15, 2024	We have two new preprints on arXiv! One on predicting the performance of LLMs on individual instances, the other one on predicting the answers of LLM benchmarks from simple features.
Oct 01, 2024	I have obtained a grant from Open Philanthropy on building a benchmark for measuring the ability of LLMs to perform data science tasks! 🤓 📊

selected publications

arXiv
Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Irene Testini* , José Hernández-Orallo , and Lorenzo Pacchiardi*

arXiv preprint arXiv:2506.08800, 2025

Abs Bib HTML

Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) are increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances—such as code execution and knowledge bases—that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.
@article{testini2025measuringdatascienceautomation, title = {{Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents}}, author = {Testini*, Irene and Hernández-Orallo, José and Pacchiardi*, Lorenzo}, year = {2025}, eprint = {2506.08800}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, journal = {arXiv preprint arXiv:2506.08800}, url = {https://arxiv.org/abs/2506.08800}, }
arXiv
General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Lexin Zhou , Lorenzo Pacchiardi , Fernando Martínez-Plumed , Katherine M. Collins , Yael Moros-Daval , Seraphina Zhang , Qinlin Zhao , Yitian Huang , Luning Sun , Jonathan E. Prunty , Zongqian Li , Pablo Sánchez-García , Kexin Jiang Chen , Pablo A. M. Casares , Jiyun Zu , John Burden , Behzad Mehrbakhsh , David Stillwell , Manuel Cebrian , Jindong Wang , Peter Henderson , Sherry Tongshuang Wu , Patrick C. Kyllonen , Lucy Cheke , Xing Xie , and José Hernández-Orallo

arXiv preprint arXiv:2503.06378, 2025

Abs Bib HTML

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead.
@article{zhou2025generalscalesunlockai, title = {{General Scales Unlock AI Evaluation with Explanatory and Predictive Power}}, author = {Zhou, Lexin and Pacchiardi, Lorenzo and Martínez-Plumed, Fernando and Collins, Katherine M. and Moros-Daval, Yael and Zhang, Seraphina and Zhao, Qinlin and Huang, Yitian and Sun, Luning and Prunty, Jonathan E. and Li, Zongqian and Sánchez-García, Pablo and Chen, Kexin Jiang and Casares, Pablo A. M. and Zu, Jiyun and Burden, John and Mehrbakhsh, Behzad and Stillwell, David and Cebrian, Manuel and Wang, Jindong and Henderson, Peter and Wu, Sherry Tongshuang and Kyllonen, Patrick C. and Cheke, Lucy and Xie, Xing and Hernández-Orallo, José}, year = {2025}, eprint = {2503.06378}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, journal = {arXiv preprint arXiv:2503.06378}, url = {https://kinds-of-intelligence-cfi.github.io/ADELE/}, }
ACL Findings
PredictaBoard: Benchmarking LLM Score Predictability

Lorenzo Pacchiardi , Konstantinos Voudouris , Ben Slater , Fernando Martínez-Plumed , José Hernández-Orallo , Lexin Zhou , and Wout Schellaert

Findings of the Association for Computational Linguistics: ACL 2025, 2025

Abs Bib HTML Code

Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated.
@article{pacchiardi2025predictaboardbenchmarkingllmscore, title = {{PredictaBoard}: Benchmarking {LLM} Score Predictability}, author = {Pacchiardi, Lorenzo and Voudouris, Konstantinos and Slater, Ben and Martínez-Plumed, Fernando and Hernández-Orallo, José and Zhou, Lexin and Schellaert, Wout}, year = {2025}, eprint = {2502.14445}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, journal = {Findings of the Association for Computational Linguistics: ACL 2025}, url = {https://predictaboard.github.io/}, dataset = {https://huggingface.co/collections/kvoudouris/predictaboard-67b6042ee09a99a3b0bbebd0} }
IJCAI
Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture

John Burden* , Marko Tešić* , Lorenzo Pacchiardi* , and José Hernández-Orallo

IJCAI 2025 Survey Track, 2025

Abs Bib HTML

Research in AI evaluation has grown increasingly complex and multidisciplinary, attracting researchers with diverse backgrounds and objectives. As a result, divergent evaluation paradigms have emerged, often developing in isolation, adopting conflicting terminologies, and overlooking each other’s contributions. This fragmentation has led to insular research trajectories and communication barriers both among different paradigms and with the general public, contributing to unmet expectations for deployed AI systems. To help bridge this insularity, in this paper we survey recent work in the AI evaluation landscape and identify six main paradigms. We characterise major recent contributions within each paradigm across key dimensions related to their goals, methodologies and research cultures. By clarifying the unique combination of questions and approaches associated with each paradigm, we aim to increase awareness of the breadth of current evaluation approaches and foster cross-pollination between different paradigms. We also identify potential gaps in the field to inspire future research directions.
@article{burden2025paradigmsaievaluationmapping, title = {Paradigms of {AI} Evaluation: Mapping Goals, Methodologies and Culture}, author = {Burden*, John and Tešić*, Marko and Pacchiardi*, Lorenzo and Hernández-Orallo, José}, year = {2025}, eprint = {2502.15620}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, journal = {IJCAI 2025 Survey Track}, url = {https://arxiv.org/abs/2502.15620}, }
ICLR 2024
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Lorenzo Pacchiardi* , Alex J Chan* , Sören Mindermann , Ilan Moscovitz , Alexa Y Pan , Yarin Gal , Owain Evans , and Jan Brauner

The Twelfth International Conference on Learning Representations, 2024

Abs Bib HTML Code

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting – prompting GPT-3.5 to lie about factual questions – the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
@article{pacchiardi2023catch, title = {How to Catch an {AI} Liar: Lie Detection in Black-Box {LLM}s by Asking Unrelated Questions}, author = {Pacchiardi*, Lorenzo and Chan*, Alex J and Mindermann, S{\"o}ren and Moscovitz, Ilan and Pan, Alexa Y and Gal, Yarin and Evans, Owain and Brauner, Jan}, journal = {The Twelfth International Conference on Learning Representations}, year = {2024}, url = {https://openreview.net/forum?id=567BjxgaTp}, }
JMLR
Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization

Lorenzo Pacchiardi , Rilwan Adewoyin , Peter Dueben , and Ritabrata Dutta

Journal of Machine Learning Research, 2024

Abs Bib HTML Code

Probabilistic forecasting relies on past observations to provide a probability distribution for a future outcome, which is often evaluated against the realization using a scoring rule. Here, we perform probabilistic forecasting with generative neural networks, which parametrize distributions on high-dimensional spaces by transforming draws from a latent variable. Generative networks are typically trained in an adversarial framework. In contrast, we propose to train generative networks to minimize a predictive-sequential (or prequential) scoring rule on a recorded temporal sequence of the phenomenon of interest, which is appealing as it corresponds to the way forecasting systems are routinely evaluated. Adversarial-free minimization is possible for some scoring rules; hence, our framework avoids the cumbersome hyperparameter tuning and uncertainty underestimation due to unstable adversarial training, thus unlocking reliable use of generative networks in probabilistic forecasting. Further, we prove consistency of the minimizer of our objective with dependent data, while adversarial training assumes independence. We perform simulation studies on two chaotic dynamical models and a benchmark data set of global weather observations; for this last example, we define scoring rules for spatial data by drawing from the relevant literature. Our method outperforms state-of-the-art adversarial approaches, especially in probabilistic calibration, while requiring less hyperparameter tuning.
@article{pacchiardi2021probabilistic, author = {Pacchiardi, Lorenzo and Adewoyin, Rilwan and Dueben, Peter and Dutta, Ritabrata}, title = {Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization}, journal = {Journal of Machine Learning Research}, year = {2024}, volume = {25}, number = {45}, pages = {1-64}, url = {https://jmlr.org/papers/v25/23-0038.html}, }