{"id":40,"date":"2023-08-31T15:03:01","date_gmt":"2023-08-31T15:03:01","guid":{"rendered":"https:\/\/www.moduloresearch.com\/?page_id=40"},"modified":"2025-11-10T17:43:12","modified_gmt":"2025-11-10T17:43:12","slug":"research","status":"publish","type":"page","link":"https:\/\/www.moduloresearch.com\/index.php\/research\/","title":{"rendered":"Research"},"content":{"rendered":"\n<p><strong>Latest<\/strong><\/p>\n\n\n\n<p><strong>Recchia, G., Mangat, C. S., Li, I., &amp; Krishnakumar, G.<\/strong>\u00a0(2025). FindTheFlaws: Annotated errors for use in scalable oversight research. arxiv:2503.22989. Recently accepted to AAAI 2026.\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2503.22989\"><strong>Link<\/strong><\/a><\/p>\n\n\n\n<p><strong>Recchia, G., Mangat, C., Nyachhyon, J., Sharma, M., Canavan, C., Epstein-Gross, D., and Abdulbari, M.<\/strong>\u00a0(2025) Confirmation bias: A challenge for scalable oversight. Presents results of two\u00a0<a href=\"https:\/\/www.alignmentforum.org\/posts\/PZtsoaoSLpKjjbMqM\/the-case-for-aligning-narrowly-superhuman-models\">sandwiching<\/a>-like experiments intended to establish baselines for simple approaches to scalable oversight. Recently accepted to AAAI 2026. <strong><a href=\"https:\/\/moduloresearch.com\/papers\/Confirmation_bias_A_challenge_for_scalable_oversight.pdf\" data-type=\"link\" data-id=\"https:\/\/moduloresearch.com\/papers\/Confirmation_bias_A_challenge_for_scalable_oversight.pdf\">Link<\/a><\/strong><\/p>\n\n\n\n<p><strong>In preparation<\/strong><\/p>\n\n\n\n<p><strong>Tan, G., Tsyplenkov, L., Nastase, E. &amp; Recchia, G. <\/strong>Investigating the limits of free-form debate as a scalable oversight strategy.<\/p>\n\n\n\n<p><strong>Contributed to<\/strong><\/p>\n\n\n\n<p><strong>Schoenegger, P., Salvi, F., Liu, J., Nan, X., Debnath, R., Fasolo, B., Leivada, E., Recchia, G., G\u00fcnther, F., et al.<\/strong>&nbsp;(2025). Large language models are more persuasive than incentivized human persuaders. Analysis team lead. Under review in PNAS Nexus. arXiv:2505.09662.&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2505.09662\"><strong>Link<\/strong><\/a><\/p>\n\n\n\n<p><strong>Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., \u2026 &amp; Krueger, D.<\/strong>&nbsp;(2024). Foundational challenges in assuring alignment and safety of large language models.&nbsp;<em>Transactions on Machine Learning Research<\/em>, 2835-8856.&nbsp;<a href=\"https:\/\/openreview.net\/forum?id=oVTkOs8Pka\"><strong>Link<\/strong><\/a><\/p>\n\n\n\n<p><strong>Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., \u2026 &amp; Verbeken, B.<\/strong>&nbsp;(2025). Humanity\u2019s Last Exam.&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2501.14249\"><strong>Link<\/strong><\/a>.&nbsp;Co-author on account of contributing question(s) that were selected for the dataset.<\/p>\n\n\n\n<p><strong>McKenzie, I. R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A., Prabhu, A., \u2026 &amp; Perez, E.<\/strong>&nbsp;(2023). Inverse scaling: When bigger isn\u2019t better.&nbsp;<em>Transactions on Machine Learning Research<\/em>.&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2306.09479\"><strong>Link<\/strong><\/a>.&nbsp;Co-author on account of submitting a winning task (e.g., identifying a task on which language model performance decreases with scale).<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Latest Recchia, G., Mangat, C. S., Li, I., &amp; Krishnakumar, G.\u00a0(2025). FindTheFlaws: Annotated errors for use in scalable oversight research. arxiv:2503.22989. Recently accepted to AAAI 2026.\u00a0Link Recchia, G., Mangat, C., Nyachhyon, J., Sharma, M., Canavan, C., Epstein-Gross, D., and Abdulbari, M.\u00a0(2025) Confirmation bias: A challenge for scalable oversight. Presents results of two\u00a0sandwiching-like experiments intended to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_themeisle_gutenberg_block_has_review":false,"footnotes":""},"class_list":["post-40","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.moduloresearch.com\/index.php\/wp-json\/wp\/v2\/pages\/40","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.moduloresearch.com\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.moduloresearch.com\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.moduloresearch.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.moduloresearch.com\/index.php\/wp-json\/wp\/v2\/comments?post=40"}],"version-history":[{"count":10,"href":"https:\/\/www.moduloresearch.com\/index.php\/wp-json\/wp\/v2\/pages\/40\/revisions"}],"predecessor-version":[{"id":123,"href":"https:\/\/www.moduloresearch.com\/index.php\/wp-json\/wp\/v2\/pages\/40\/revisions\/123"}],"wp:attachment":[{"href":"https:\/\/www.moduloresearch.com\/index.php\/wp-json\/wp\/v2\/media?parent=40"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}