Skip to content

ML Research Benchmark (MLRB)

๐Ÿ“Ž Paper Agent Tasks Evaluation

Introduction to the ML Research Benchmark

Welcome to the ML Research Benchmark!

ML Research Bench is a benchmark designed to assess AI agents across a wide array of AI research areas, focusing on their capabilities in advanced machine learning research. This documentation provides an overview of the tasks and objectives involved in evaluating these agents.


High Level Overview

We provide three packages to perform the benchmark evaluation. Conceptually, the benchmark workflow is simple:

  1. The agent is given a task to complete using agent-tasks
  2. The model the agent produces is evaluated on a set of metrics using agent-eval

Quickstart Guide

This guide will walk you through the process of setting up the ML Research Benchmark and evaluating your AI agent's performance on a task. Here, we will run the agent locally on a single task and evaluate its performance on a task.

Local Setup

Step 1: Setup the environment

Clone the repository

git clone https://github.com/AlgorithmicResearchGroup/ML-Research-Agent
cd ML-Research-Agent

Install the dependencies

pip install -e .

Creat a .env file

Create a .env file in the root directory of the repository with the following content:

OPENAI=sk-...   
ANTHROPIC=sk-ant-api03...
DB_USER=admin
DB_PASSWORD=admin
DB_HOST=my-db-host
DB_PORT=12345
DB_NAME=defaultdb


Step 2: Run the task

We provide a baseline agent for testing, however we encorage you to run the benchmark on your own agent. The baseline agent is designed to output several metrics, and a model for evaluation. You can run the agent with the following command:

bash run.sh ghcr.io/algorithmicresearchgroup/ml-research-agent:latest full_benchmark \
    openai \
    0 \
    edge_llm_training \
    24h \
    /home/ubuntu/.env

Step 3: Run The Evaluation

The agent will produce a model to evaluate. To evaluate the performance of your AI agent, you can run the following commands

docker pull ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest
docker run -it ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest --model_args /path/to/your/model --task llm_efficiency

Step 4: View The Results

The evaluation script will output the results of your agent's performance on the specified task. You can view the results in the console or in the generated report file. For example, if you ran the llm_efficiency task, you should see output like the following:

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Metric                   โ”ƒ Value                                                                                                  โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Total Parameters         โ”‚ 124,439,808                                                                                            โ”‚
โ”‚ Non-Trainable Parameters โ”‚ 0                                                                                                      โ”‚
โ”‚ Trainable Parameters     โ”‚ 124,439,808                                                                                            โ”‚
โ”‚ MMLU Score               โ”‚ {'mmlu': {'accuracy': 0.2578947368421053, 'stderr': 0.01842801316560085}, 'mmlu_humanities':           โ”‚
โ”‚                          โ”‚ {'accuracy': 0.25384615384615383, 'stderr': 0.039139326980701925}, 'mmlu_other': {'accuracy':          โ”‚
โ”‚                          โ”‚ 0.24615384615384617, 'stderr': 0.03768445758127967}, 'mmlu_social_sciences': {'accuracy':              โ”‚
โ”‚                          โ”‚ 0.2916666666666667, 'stderr': 0.04166666666666667}, 'mmlu_stem': {'accuracy': 0.24736842105263157,     โ”‚
โ”‚                          โ”‚ 'stderr': 0.03133433526226623}}                                                                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The evaluation report will be printed to the console and saved in the results directory as a .json file.

Congratulations! You have successfully completed the quickstart guide for the ML Research Benchmark. You are now ready to evaluate your AI agent's performance across a range of challenging tasks. For more detailed instructions and task descriptions, refer to the benchmark documentation.

For more information on the ML Research Benchmark, please refer to the User documentation

For more comprehensives instructions on setting up the agent, running tasks in parallel, and evaluating your agent's performance, please refer to the documentation: