Quickstart Guide

This guide will walk you through the process of setting up the ML Research Benchmark and evaluating your AI agent's performance on a task. Here, we will run the agent locally on a single task and evaluate its performance on a task.

Local Setup

Step 1: Setup the environment

Clone the repository

git clone https://github.com/AlgorithmicResearchGroup/ML-Research-Agent
cd ML-Research-Agent

Creat a .env file

Create a .env file in the root directory of the repository with the following content:

OPENAI=sk-...
ANTHROPIC=sk-ant-api03...
DB_USER=admin
DB_PASSWORD=admin
DB_HOST=my-db-host
DB_PORT=12345
DB_NAME=defaultdb

Step 2: Run the task

We provide a baseline agent for testing, however we encorage you to run the benchmark on your own agent. The baseline agent is designed to output several metrics, and a model for evaluation. You can run the agent with the following command:

bash run.sh ghcr.io/algorithmicresearchgroup/agent full_benchmark \
    openai \
    0 \
    edge_llm_training \
    24h \
    /home/ubuntu/.env

Step 3: Run The Evaluation

To evaluate the performance of your AI agent, you can run the following commands

docker pull ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest
docker run -it ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest --model_args </path/to/your/model> --task <task_name>

Step 4: View The Results

The evaluation script will output the results of your agent's performance on the specified task. You can view the results in the console or in the generated report file.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                   ┃ Value                                                                                                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Total Parameters         │ 124,439,808                                                                                            │
│ Non-Trainable Parameters │ 0                                                                                                      │
│ Trainable Parameters     │ 124,439,808                                                                                            │
│ MMLU Score               │ {'mmlu': {'accuracy': 0.2578947368421053, 'stderr': 0.01842801316560085}, 'mmlu_humanities':           │
│                          │ {'accuracy': 0.25384615384615383, 'stderr': 0.039139326980701925}, 'mmlu_other': {'accuracy':          │
│                          │ 0.24615384615384617, 'stderr': 0.03768445758127967}, 'mmlu_social_sciences': {'accuracy':              │
│                          │ 0.2916666666666667, 'stderr': 0.04166666666666667}, 'mmlu_stem': {'accuracy': 0.24736842105263157,     │
│                          │ 'stderr': 0.03133433526226623}}                                                                        │
└──────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The evaluation report will be printed to the console and saved in the results directory as a .json file.

Congratulations! You have successfully completed the quickstart guide for the ML Research Benchmark. You are now ready to evaluate your AI agent's performance across a range of challenging tasks. For more detailed instructions and task descriptions, refer to the benchmark documentation.

For more information on the ML Research Benchmark, please refer to the User documentation

For more comprehensives instructions on setting up the agent, running tasks in parallel, and evaluating your agent's performance, please refer to the documentation: