As machine learning models become increasingly capable and are presented with more complex and hard-to-evaluate tasks, it becomes increasingly difficult to determine if their outputs are trustworthy. Much of Modulo’s research focuses on extending our ability to successfully evaluate whether models are behaving in ways we prefer, even in cases where doing so is difficult or costly.
Specific questions we have investigated, are currently investigating, and plan to investigate in the future include:
- In Measuring Progress on Scalable Oversight for Large Language Models, humans conversed with language models to answer questions more accurately than either language models alone or humans alone. Can a systematic investigation of the transcripts reveal key factors that differentiated successful from unsuccessful question-answering attempts? (investigation complete; write-up pending)
- ‘Sandwiching’ experiments (defined here): how do we provide meaningful feedback to AI systems that are more capable than we are at specific tasks?
- Can exposure to language models’ arguments for both sides of a question, or other AI-assisted interventions, improve human participants’ ability to judge when a language model is answering a question correctly, even in topic areas where the language model answers questions more accurately than the (unassisted) human? If so, can we embed the language model in an agent architecture that imitates this judgment process, and does doing so improve our ability to scalably extract accurate claims from language models? (initial study complete (write-up pending); further research in progress)
- Can we develop protocols that help improve human participants’ ability to judge when a language model is answering a question correctly, in topic areas where the language model answers questions more accurately than the (unassisted) human, and where the questions are extremely difficult for people who are not experts in the relevant domain to answer—even with access to the internet & external tools? (planned for 2024)
- Can we improve language models’ ability to detect and report flaws or attempts at deception in arguments generated by language models? How well or poorly does fine-tuning language models achieve this in “easy” domains scale to “hard” domains? (data collection to enable this research in progress)
Modulo also plans to share lessons learned from the above activities, as well as novel datasets, that may help facilitate the work of AI safety teams.