Agent-Eval

We use lm-evaluation-harness to evaluate many of the tasks. Where we need additional metrics, we leverage Agent-Eval. Agent-Eval is a command-line tool for evaluating the performance of an agent after a task is complete. The tool takes in a model, and calculates evaluation metrics such as perplexity, accuracy, and latency, and generates evaluation reports, as well as a final score on the task.

To evaluate the agent's performance, follow these steps:

Run the agent a task.
Execute the CLI to calculate the evaluation metrics.
Review the evaluation report to assess the agent's final score.

Installation

Install the evaluation library using docker:

docker pull ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest

The tool requires the following arguments:

model_args (required): Path to the model
task (required): Task to perform
bits (optional): True if the model is quantized with bitsandbytes
use_ort (optional): True if the model is quantized with ORT
quantized (optional): True if the model is quantized

Example

# Pull the latest image
docker pull ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest

# Run the task
docker run -it ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest --model_args /path/to/your/model --task llm_efficiency

The results of the evaluation will be saved in the results directory. The evaluation report will be saved in the results directory as a .json file. Running the task will also print the evaluation report to the console, e.g.:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                   ┃ Value                                                                                                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Total Parameters         │ 124,439,808                                                                                            │
│ Non-Trainable Parameters │ 0                                                                                                      │
│ Trainable Parameters     │ 124,439,808                                                                                            │
│ MMLU Score               │ {'mmlu': {'accuracy': 0.2578947368421053, 'stderr': 0.01842801316560085}, 'mmlu_humanities':           │
│                          │ {'accuracy': 0.25384615384615383, 'stderr': 0.039139326980701925}, 'mmlu_other': {'accuracy':          │
│                          │ 0.24615384615384617, 'stderr': 0.03768445758127967}, 'mmlu_social_sciences': {'accuracy':              │
│                          │ 0.2916666666666667, 'stderr': 0.04166666666666667}, 'mmlu_stem': {'accuracy': 0.24736842105263157,     │
│                          │ 'stderr': 0.03133433526226623}}                                                                        │
└──────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The evaluation report will be saved in the results directory as a .json file.

Development

We welcome contributions to the agent-eval tool!

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd agent-eval
python -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

Finally, create a new branch and start coding! We welcome pull requests with new features, bug fixes, and improvements to the tool.