Using the Agent

We provide a baseline agent in the agent directory, that can be used to evaluate the benchmark. The agent can interact with either an OpenAI or Anthropic model. We encourage you to run the benchmark on your own agent.

Designing Your Agent

Your agent should consume a prompt, provided by our agent-tasks] library. You can use the agent-tasks library like so:

from agent_tasks import get_task

# Example usage
result = get_task(path="./", benchmark="benchmark", task="mini_baby_lm")

# Print the task name
print(result['name'])

# Print the prompt
print(result['prompt'])

Where - path is the path to the directory where you would like the example task to be downloaded - benchmark is the name of the benchmark, - task is the name of the task.

This function will copy the task to the specified path and return a dictionary with the task name and prompt.

{
    "name": str, - name of the task
    "prompt": str, - prompt for the task
}

Your agent should then use the prompt to generate a model, which can then be evaluated using the agent-eval library.

That's it!

Using Our Baseline Agent

The baseline agent that we provide is composed of two main components: The supervisor and the worker agent. The supervisor is responsible for managing the instructions and results of the tasks. The worker agent is responsible for executing the tasks.

sequenceDiagram
  autonumber
  Supervisor->>Worker: task instructions
  loop Model
      Worker->>Worker:Task iteration
  Worker->>Supervisor: task results
  end

We have provided the worker a set of tools to use to complete each task. These tools are located in the tools directory. The tools are designed leverage the function calling capabilities of the current models. The tools are designed to be used in a modular fashion, so that the worker can easily swap out tools as needed.

Baseline Agent Tools

Our baseline agent has access to the following tools, which leverage the function calling capabilities of the current models:

run_python - Run a Python script
run_bash - Run a Bash script
write_code - Write code to a file
insert_code - Insert code into a file
replace_code - Replace code in a file
delete_code - Delete code from a file
scratchpad - Record important notes about the current task
github_get_readme - Get the README file from a GitHub repository
github_list_files - List the files in a GitHub repository
github_get_file_code - Get the code from a file in a GitHub repository
search_papers - Search for papers on Semantic Scholar
get_paper_details - Get the details of a paper on Semantic Scholar
get_paper_citations - Get the citations of a paper on Semantic Scholar
download_paper - Download a paper from Semantic Scholar
thought - Record ReAct-like a thought

Tools can be developed and added to the agent as needed. All tools should be added to the tools directory, and registered in the tools_registry.py file.

Running Many Agents

We run several worker agents in parallel to complete the tasks. The supervisor provides instructions to many workers, and collects the results.

stateDiagram-v2
  state fork_state <<fork>>
    fork_state --> Supervisor
    Supervisor --> Agent_1
    Supervisor --> Agent_2
    Supervisor --> Agent_3
    Supervisor --> Agent_4
    Supervisor --> ....

    state Results <<join>>
    Agent_1 --> Results
    Agent_2 --> Results
    Agent_3 --> Results
    Agent_4 --> Results
    .... --> Results
    Results --> Agent_Eval
    Agent_Eval --> [*]

Completing a Task

Upon completion of the task, the agent will output the following table, which will be used in the evaluation of the agent's performance on the specified task, where:

Task Number is the task number the agent was asked to complete
Run ID is the unique identifier for the run
Submission is the status of the agent's submission
Model Path is the path to the model
Total Tokens is the total number of tokens used to complete the task
Total Turns is the total number of prompts to complete the task
Time Taken in Seconds is the time in seconds it took the agent to complete the task

Metric	Value
Task Number	3
Run ID	2764317408
Submission	successfully trained the model
Model Path	/home/paperspace/Desktop/ai_research_bench/agent/2764317408/model.pt
Total Tokens	3150
Total Turns	61
Time Taken in Seconds	625.662373781204224