ML Research Benchmark (MLRB)
Introduction to the ML Research Benchmark
Welcome to the ML Research Benchmark!
ML Research Bench is a benchmark designed to assess AI agents across a wide array of AI research areas, focusing on their capabilities in advanced machine learning research. This documentation provides an overview of the tasks and objectives involved in evaluating these agents.
High Level Overview
We provide three packages to perform the benchmark evaluation. Conceptually, the benchmark workflow is simple:
- The agent is given a task to complete using
agent-tasks
- The model the agent produces is evaluated on a set of metrics using
agent-eval
Quickstart Guide
This guide will walk you through the process of setting up the ML Research Benchmark and evaluating your AI agent's performance on a task. Here, we will run the agent locally on a single task and evaluate its performance on a task.
Local Setup
Step 1: Setup the environment
Clone the repository
Install the dependencies
Creat a .env file
Create a .env
file in the root directory of the repository with the following content:
OPENAI=sk-...
ANTHROPIC=sk-ant-api03...
DB_USER=admin
DB_PASSWORD=admin
DB_HOST=my-db-host
DB_PORT=12345
DB_NAME=defaultdb
Step 2: Run the task
We provide a baseline agent for testing, however we encorage you to run the benchmark on your own agent. The baseline agent is designed to output several metrics, and a model for evaluation. You can run the agent with the following command:
bash run.sh ghcr.io/algorithmicresearchgroup/ml-research-agent:latest full_benchmark \
openai \
0 \
edge_llm_training \
24h \
/home/ubuntu/.env
Step 3: Run The Evaluation
The agent will produce a model to evaluate. To evaluate the performance of your AI agent, you can run the following commands
docker pull ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest
docker run -it ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest --model_args /path/to/your/model --task llm_efficiency
Step 4: View The Results
The evaluation script will output the results of your agent's performance on the specified task. You can view the results in the console or in the generated report file. For example, if you ran the llm_efficiency
task, you should see output like the following:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Metric โ Value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Total Parameters โ 124,439,808 โ
โ Non-Trainable Parameters โ 0 โ
โ Trainable Parameters โ 124,439,808 โ
โ MMLU Score โ {'mmlu': {'accuracy': 0.2578947368421053, 'stderr': 0.01842801316560085}, 'mmlu_humanities': โ
โ โ {'accuracy': 0.25384615384615383, 'stderr': 0.039139326980701925}, 'mmlu_other': {'accuracy': โ
โ โ 0.24615384615384617, 'stderr': 0.03768445758127967}, 'mmlu_social_sciences': {'accuracy': โ
โ โ 0.2916666666666667, 'stderr': 0.04166666666666667}, 'mmlu_stem': {'accuracy': 0.24736842105263157, โ
โ โ 'stderr': 0.03133433526226623}} โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The evaluation report will be printed to the console and saved in the results
directory as a .json
file.
Congratulations! You have successfully completed the quickstart guide for the ML Research Benchmark. You are now ready to evaluate your AI agent's performance across a range of challenging tasks. For more detailed instructions and task descriptions, refer to the benchmark documentation.
For more information on the ML Research Benchmark, please refer to the User documentation
For more comprehensives instructions on setting up the agent, running tasks in parallel, and evaluating your agent's performance, please refer to the documentation: