Latest
Recchia, G., Mangat, C. S., Li, I., & Krishnakumar, G. (2025). FindTheFlaws: Annotated errors for use in scalable oversight research. arxiv:2503.22989. Link
Recchia, G., Mangat, C., Nyachhyon, J., Sharma, M., Canavan, C., Epstein-Gross, D., and Abdulbari, M. (2025) Confirmation bias: A challenge for scalable oversight. Presents results of two sandwiching-like experiments intended to establish baselines for simple approaches to scalable oversight. Link
Contributed to
Schoenegger, P., Salvi, F., Liu, J., Nan, X., Debnath, R., Fasolo, B., Leivada, E., Recchia, G., Günther, F., et al. (2025). Large language models are more persuasive than incentivized human persuaders. Analysis team lead. arXiv:2505.09662. Link
Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., … & Krueger, D. (2024). Foundational challenges in assuring alignment and safety of large language models. Transactions on Machine Learning Research, 2835-8856. Link
Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., … & Verbeken, B. (2025). Humanity’s Last Exam. Link. Co-author on account of contributing question(s) that were selected for the dataset.
McKenzie, I. R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A., Prabhu, A., … & Perez, E. (2023). Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research. Link. Co-author on account of submitting a winning task (e.g., identifying a task on which language model performance decreases with scale).