Quickstart Guide
This guide will walk you through the process of setting up the ML Research Benchmark and evaluating your AI agent's performance on a task. Here, we will run the agent locally on a single task and evaluate its performance on a task.
Local Setup
Step 1: Setup the environment
Clone the repository
Creat a .env file
Create a .env
file in the root directory of the repository with the following content:
OPENAI=sk-...
ANTHROPIC=sk-ant-api03...
DB_USER=admin
DB_PASSWORD=admin
DB_HOST=my-db-host
DB_PORT=12345
DB_NAME=defaultdb
Step 2: Run the task
We provide a baseline agent for testing, however we encorage you to run the benchmark on your own agent. The baseline agent is designed to output several metrics, and a model for evaluation. You can run the agent with the following command:
bash run.sh ghcr.io/algorithmicresearchgroup/agent full_benchmark \
openai \
0 \
edge_llm_training \
24h \
/home/ubuntu/.env
Step 3: Run The Evaluation
To evaluate the performance of your AI agent, you can run the following commands
docker pull ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest
docker run -it ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest --model_args </path/to/your/model> --task <task_name>
Step 4: View The Results
The evaluation script will output the results of your agent's performance on the specified task. You can view the results in the console or in the generated report file.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Total Parameters │ 124,439,808 │
│ Non-Trainable Parameters │ 0 │
│ Trainable Parameters │ 124,439,808 │
│ MMLU Score │ {'mmlu': {'accuracy': 0.2578947368421053, 'stderr': 0.01842801316560085}, 'mmlu_humanities': │
│ │ {'accuracy': 0.25384615384615383, 'stderr': 0.039139326980701925}, 'mmlu_other': {'accuracy': │
│ │ 0.24615384615384617, 'stderr': 0.03768445758127967}, 'mmlu_social_sciences': {'accuracy': │
│ │ 0.2916666666666667, 'stderr': 0.04166666666666667}, 'mmlu_stem': {'accuracy': 0.24736842105263157, │
│ │ 'stderr': 0.03133433526226623}} │
└──────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘
The evaluation report will be printed to the console and saved in the results
directory as a .json
file.
Congratulations! You have successfully completed the quickstart guide for the ML Research Benchmark. You are now ready to evaluate your AI agent's performance across a range of challenging tasks. For more detailed instructions and task descriptions, refer to the benchmark documentation.
For more information on the ML Research Benchmark, please refer to the User documentation
For more comprehensives instructions on setting up the agent, running tasks in parallel, and evaluating your agent's performance, please refer to the documentation: