Capability Evaluations

Language models’ performance on standard benchmarks fails to adequately capture their real-world capabilities in at least three ways. First, many of these benchmarks do not measure the abilities that truly matter in practical applications. Second, language models are opaque; different probing techniques can elicit competencies that are not apparent when using naïve prompts. Finally, most benchmarks overlook the real-world contexts in which these models will be deployed. External “scaffolding” such as access to online tools and conditioning on detailed protocols may either dramatically enhance or barely affect a model’s performance, depending on the context.

These nuances imply that determining a language model’s true capabilities can be complex. Overestimating a model’s competence in a particular area can lead to harms from the model being deployed in a context for which it is unsuited. Underestimating capabilities can also be perilous, particularly if dangerous capabilities are exploited by malicious actors or autonomous systems pursuing seemingly benign objectives. Addressing these issues is vital and requires careful information management to prevent enabling nefarious activities.

Modulo Research is pioneering a robust agent architecture to enable a more realistic understanding of a given model’s capabilities, limits, and potential risks. By understanding the shortcomings of current systems and filling in the gaps with a combination of technical innovations and simulation, we aim to provide companies, AI labs and policymakers with a more accurate view of what may be coming down the pike.