Using the Agent
We provide a baseline agent in the agent
directory, that can be used to evaluate the benchmark. The agent can interact with either an OpenAI or Anthropic model. We encourage you to run the benchmark on your own agent.
Designing Your Agent
Your agent should consume a prompt, provided by our agent-tasks] library. You can use the agent-tasks library like so:
from agent_tasks import get_task
# Example usage
result = get_task(path="./", benchmark="benchmark", task="mini_baby_lm")
# Print the task name
print(result['name'])
# Print the prompt
print(result['prompt'])
Where
- path
is the path to the directory where you would like the example task to be downloaded
- benchmark
is the name of the benchmark,
- task
is the name of the task.
This function will copy the task to the specified path and return a dictionary with the task name and prompt.
Your agent should then use the prompt to generate a model, which can then be evaluated using the agent-eval
library.
That's it!
Using Our Baseline Agent
The baseline agent that we provide is composed of two main components: The supervisor and the worker agent. The supervisor is responsible for managing the instructions and results of the tasks. The worker agent is responsible for executing the tasks.
sequenceDiagram
autonumber
Supervisor->>Worker: task instructions
loop Model
Worker->>Worker:Task iteration
Worker->>Supervisor: task results
end
We have provided the worker a set of tools to use to complete each task. These tools are located in the tools
directory. The tools are designed leverage the function calling capabilities of the current models. The tools are designed to be used in a modular fashion, so that the worker can easily swap out tools as needed.
Baseline Agent Tools
Our baseline agent has access to the following tools, which leverage the function calling capabilities of the current models:
run_python
- Run a Python scriptrun_bash
- Run a Bash scriptwrite_code
- Write code to a fileinsert_code
- Insert code into a filereplace_code
- Replace code in a filedelete_code
- Delete code from a filescratchpad
- Record important notes about the current taskgithub_get_readme
- Get the README file from a GitHub repositorygithub_list_files
- List the files in a GitHub repositorygithub_get_file_code
- Get the code from a file in a GitHub repositorysearch_papers
- Search for papers on Semantic Scholarget_paper_details
- Get the details of a paper on Semantic Scholarget_paper_citations
- Get the citations of a paper on Semantic Scholardownload_paper
- Download a paper from Semantic Scholarthought
- Record ReAct-like a thought
Tools can be developed and added to the agent as needed. All tools should be added to the tools
directory, and registered in the tools_registry.py
file.
Running Many Agents
We run several worker agents in parallel to complete the tasks. The supervisor provides instructions to many workers, and collects the results.
stateDiagram-v2
state fork_state <<fork>>
fork_state --> Supervisor
Supervisor --> Agent_1
Supervisor --> Agent_2
Supervisor --> Agent_3
Supervisor --> Agent_4
Supervisor --> ....
state Results <<join>>
Agent_1 --> Results
Agent_2 --> Results
Agent_3 --> Results
Agent_4 --> Results
.... --> Results
Results --> Agent_Eval
Agent_Eval --> [*]
Completing a Task
Upon completion of the task, the agent will output the following table, which will be used in the evaluation of the agent's performance on the specified task, where:
Task Number
is the task number the agent was asked to completeRun ID
is the unique identifier for the runSubmission
is the status of the agent's submissionModel Path
is the path to the modelTotal Tokens
is the total number of tokens used to complete the taskTotal Turns
is the total number of prompts to complete the taskTime Taken in Seconds
is the time in seconds it took the agent to complete the task
Metric | Value |
---|---|
Task Number | 3 |
Run ID | 2764317408 |
Submission | successfully trained the model |
Model Path | /home/paperspace/Desktop/ai_research_bench/agent/2764317408/model.pt |
Total Tokens | 3150 |
Total Turns | 61 |
Time Taken in Seconds | 625.662373781204224 |