Benchmark Details
Continuous AI R&D Research Tasks
Many agentic benchmarks are binary in nature, where the agent either completes the task or does not, leading to quickly saturated benchmarks. For ML Research Bench, we have designed continuous tasks that can be improved upon indefinitely.
Current iterations of agents struggle to perform the subtasks of this benchmarks tasks. In many cases, they complete only a subset of the tasks, and do not meet the baseline sandard, creating a trained model, necessary for a task to be labeled successful. As agent scaffolds and models improve, the we suspect that agents should successfully complete most tasks. The tasks are designed to be continuous rather than binary, in that, future agents are not only expected to successfully complete the task, but also to score as high as possible on the tasks final score. This means that these tasks can be improved upon indefinitely across several axis:
- The speed (time) in which the agent completes the task
- The number of tokens necessary to complete the task
- The metric improvement on the task
By providing continuous tasks, we are better able to:
- Project improvements from future models and agent scaffolds
- Hedge against the benchmark being quickly saturated
- Provide a more precise view of current agentic capabilities
In practice, this will look like ever increasingly hard tasks. While tasks are simple to complete, substantially exceeding the baselines becomes increasingly a research project on the scale that we might find a capabilities researcher performing at a major AI lab.
In this sense we hope to project a trajectory for the agents as their approaches to these problems become more complex, and the speed and efficiency at which they solve the problems increase over time. The goal is to mimic the trajectory of real world benchmarks performed by researchers in the field, where the performance of each agent can be compared to the previous iterations indefinitely.