Research

Latest

Recchia, G., Mangat, C. S., Li, I., & Krishnakumar, G. (2025). FindTheFlaws: Annotated errors for use in scalable oversight research. arxiv:2503.22989. Recently accepted to AAAI 2026. Link

Recchia, G., Mangat, C., Nyachhyon, J., Sharma, M., Canavan, C., Epstein-Gross, D., and Abdulbari, M. (2025) Confirmation bias: A challenge for scalable oversight. Presents results of two sandwiching-like experiments intended to establish baselines for simple approaches to scalable oversight. Recently accepted to AAAI 2026. Link

In preparation

Tan, G., Tsyplenkov, L., Nastase, E. & Recchia, G. Investigating the limits of free-form debate as a scalable oversight strategy.

Contributed to

Schoenegger, P., Salvi, F., Liu, J., Nan, X., Debnath, R., Fasolo, B., Leivada, E., Recchia, G., Günther, F., et al. (2025). Large language models are more persuasive than incentivized human persuaders. Analysis team lead. Under review in PNAS Nexus. arXiv:2505.09662. Link

Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., … & Krueger, D. (2024). Foundational challenges in assuring alignment and safety of large language models. Transactions on Machine Learning Research, 2835-8856. Link

Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., … & Verbeken, B. (2025). Humanity’s Last Exam. Link. Co-author on account of contributing question(s) that were selected for the dataset.

McKenzie, I. R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A., Prabhu, A., … & Perez, E. (2023). Inverse scaling: When bigger isn’t better. Transactions on Machine Learning ResearchLink. Co-author on account of submitting a winning task (e.g., identifying a task on which language model performance decreases with scale).