Skip to content

Using the Agent

We provide a baseline agent in the agent directory, that can be used to evaluate the benchmark. The agent can interact with either an OpenAI or Anthropic model. We encourage you to run the benchmark on your own agent.

Designing Your Agent

Your agent should consume a prompt, provided by our agent-tasks] library. You can use the agent-tasks library like so:

from agent_tasks import get_task

# Example usage
result = get_task(path="./", benchmark="benchmark", task="mini_baby_lm")

# Print the task name
print(result['name'])

# Print the prompt
print(result['prompt'])

Where - path is the path to the directory where you would like the example task to be downloaded - benchmark is the name of the benchmark, - task is the name of the task.

This function will copy the task to the specified path and return a dictionary with the task name and prompt.

{
    "name": str, - name of the task
    "prompt": str, - prompt for the task
}

Your agent should then use the prompt to generate a model, which can then be evaluated using the agent-eval library.

That's it!

Using Our Baseline Agent

The baseline agent that we provide is composed of two main components: The supervisor and the worker agent. The supervisor is responsible for managing the instructions and results of the tasks. The worker agent is responsible for executing the tasks.

sequenceDiagram
  autonumber
  Supervisor->>Worker: task instructions
  loop Model
      Worker->>Worker:Task iteration
  Worker->>Supervisor: task results
  end

We have provided the worker a set of tools to use to complete each task. These tools are located in the tools directory. The tools are designed leverage the function calling capabilities of the current models. The tools are designed to be used in a modular fashion, so that the worker can easily swap out tools as needed.

Baseline Agent Tools

Our baseline agent has access to the following tools, which leverage the function calling capabilities of the current models:

  • run_python - Run a Python script
  • run_bash - Run a Bash script
  • write_code - Write code to a file
  • insert_code - Insert code into a file
  • replace_code - Replace code in a file
  • delete_code - Delete code from a file
  • scratchpad - Record important notes about the current task
  • github_get_readme - Get the README file from a GitHub repository
  • github_list_files - List the files in a GitHub repository
  • github_get_file_code - Get the code from a file in a GitHub repository
  • search_papers - Search for papers on Semantic Scholar
  • get_paper_details - Get the details of a paper on Semantic Scholar
  • get_paper_citations - Get the citations of a paper on Semantic Scholar
  • download_paper - Download a paper from Semantic Scholar
  • thought - Record ReAct-like a thought

Tools can be developed and added to the agent as needed. All tools should be added to the tools directory, and registered in the tools_registry.py file.

Running Many Agents

We run several worker agents in parallel to complete the tasks. The supervisor provides instructions to many workers, and collects the results.

stateDiagram-v2
  state fork_state <<fork>>
    fork_state --> Supervisor
    Supervisor --> Agent_1
    Supervisor --> Agent_2
    Supervisor --> Agent_3
    Supervisor --> Agent_4
    Supervisor --> ....

    state Results <<join>>
    Agent_1 --> Results
    Agent_2 --> Results
    Agent_3 --> Results
    Agent_4 --> Results
    .... --> Results
    Results --> Agent_Eval
    Agent_Eval --> [*]

Completing a Task

Upon completion of the task, the agent will output the following table, which will be used in the evaluation of the agent's performance on the specified task, where:

  • Task Number is the task number the agent was asked to complete
  • Run ID is the unique identifier for the run
  • Submission is the status of the agent's submission
  • Model Path is the path to the model
  • Total Tokens is the total number of tokens used to complete the task
  • Total Turns is the total number of prompts to complete the task
  • Time Taken in Seconds is the time in seconds it took the agent to complete the task
Metric Value
Task Number 3
Run ID 2764317408
Submission successfully trained the model
Model Path /home/paperspace/Desktop/ai_research_bench/agent/2764317408/model.pt
Total Tokens 3150
Total Turns 61
Time Taken in Seconds 625.662373781204224