Tasks

Benchmark Tasks

Task Design

The benchmark is made up of 7 tasks that the agent performs. The tasks are designed to be a approximation two of real-world conference challenges that machine learning researcher undertake.

Task Name	Metric	Time Limit	Compute Requirements
LLM Efficiency Challenge	Subset of MMLU benchmark	24 hours	One A100 40GB GPU, 128GB RAM, 500GB Disk
BabyLM Challenge (Strict-Small)	Performance on BLiMP dataset	24 hours	One A100 40GB GPU, 128GB RAM, 500GB Disk
MiniPile Challenge	Performance on SuperGLUE	24 hours	One A100 40GB GPU
LLM-Merging Competition	Performance on MMLU benchmark	24 hours (1 hour for merging/fine-tuning and evaluation)	One A100 40GB GPU
Edge LLMs Challenge: Compression	Performance on MMLU benchmark	24 hours	One A100 40GB GPU
Edge LLMs Challenge: Training from Scratch	Performance on MMLU benchmark	24 hours	One A100 40GB GPU
ICML 2024 Challenges on Automated Math Reasoning: Track 1-1 Autoformalization	Percent Compiled,ROUGE-L,BLEU	24 hours	One A100 40GB GPU

This table provides a concise overview of each task, including its name, evaluation metric, time limit, and compute requirements.

(Optional) Pre-Completion Rubric

We provide an optional rubric with each task to evaluate the agents abilities to complete subgoal. This is provided in the cases where an agent cannot produce a model for evaluation, but we would like to record it's progress on the task. Each task is broken down into subtasks that the agent must complete.

The agent is judged on a binary scale [Success/Failure] on each subtask. A successful run mean that the agent performed all subtasks and met or exceeded the naive baseline. The pre-completion rubric will be manually evaluated.

If a successful run is achieved an the agent does produc a model, the agent is scored with the evaluation rubric, and given a final score.

Task Prompts

The agent is given a task to perform in the following format:

Your task is to: 

{{ task }}

Task Details:
- Your goal is to acheive the best possible score given the rubric below.
- You are working exclusively with the following dataset: {{ dataset }}. 
- The dataset has is constructed like this: 

{{ dataset_shape }}

- Models: You must use any of the following models - {{ models }}
- Compute: You have access to a {{ compute }}.
- Time: You have a maximum time limit of {{ time_limit }}
- Result: The task should result in {{ result }}
- Evaluation: You are to measure your success on this goal with {{ evaluation }}
- Rules: Here are the rules for the task: 

{{ rules }}

(!! Important !!): You are not responsible for running the evaluation. This will be done by the organizers. 

(!! Important !!): You are responsible for producing the best possible model for the given task. 

Packages numpy, torch, torchvision, and datasets, and transformers are installed.

The output of your research will be evaluated under the following conditions: 

{{additional_info}}