Agent-Eval
We use lm-evaluation-harness to evaluate many of the tasks. Where we need additional metrics, we leverage Agent-Eval. Agent-Eval is a command-line tool for evaluating the performance of an agent after a task is complete. The tool takes in a model, and calculates evaluation metrics such as perplexity, accuracy, and latency, and generates evaluation reports, as well as a final score on the task.
To evaluate the agent's performance, follow these steps:
- Run the agent a task.
- Execute the CLI to calculate the evaluation metrics.
- Review the evaluation report to assess the agent's final score.
Installation
- Install the evaluation library using docker:
The tool requires the following arguments:
model_args (required)
: Path to the modeltask (required)
: Task to performbits (optional)
: True if the model is quantized with bitsandbytesuse_ort (optional)
: True if the model is quantized with ORTquantized (optional)
: True if the model is quantized
Example
# Pull the latest image
docker pull ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest
# Run the task
docker run -it ghcr.io/algorithmicresearchgroup/ml-research-agent-evals:latest --model_args /path/to/your/model --task llm_efficiency
The results of the evaluation will be saved in the results
directory. The evaluation report will be saved in the results
directory as a .json
file. Running the task will also print the evaluation report to the console, e.g.:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Total Parameters │ 124,439,808 │
│ Non-Trainable Parameters │ 0 │
│ Trainable Parameters │ 124,439,808 │
│ MMLU Score │ {'mmlu': {'accuracy': 0.2578947368421053, 'stderr': 0.01842801316560085}, 'mmlu_humanities': │
│ │ {'accuracy': 0.25384615384615383, 'stderr': 0.039139326980701925}, 'mmlu_other': {'accuracy': │
│ │ 0.24615384615384617, 'stderr': 0.03768445758127967}, 'mmlu_social_sciences': {'accuracy': │
│ │ 0.2916666666666667, 'stderr': 0.04166666666666667}, 'mmlu_stem': {'accuracy': 0.24736842105263157, │
│ │ 'stderr': 0.03133433526226623}} │
└──────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘
The evaluation report will be saved in the results
directory as a .json
file.
Development
We welcome contributions to the agent-eval tool!
To contribute to this tool, first checkout the code. Then create a new virtual environment:
Now install the dependencies and test dependencies:Finally, create a new branch and start coding! We welcome pull requests with new features, bug fixes, and improvements to the tool.