Tasks
Benchmark Tasks
Task Design
The benchmark is made up of 7 tasks that the agent performs. The tasks are designed to be a approximation two of real-world conference challenges that machine learning researcher undertake.
Task Name | Metric | Time Limit | Compute Requirements |
---|---|---|---|
LLM Efficiency Challenge | Subset of MMLU benchmark | 24 hours | One A100 40GB GPU, 128GB RAM, 500GB Disk |
BabyLM Challenge (Strict-Small) | Performance on BLiMP dataset | 24 hours | One A100 40GB GPU, 128GB RAM, 500GB Disk |
MiniPile Challenge | Performance on SuperGLUE | 24 hours | One A100 40GB GPU |
LLM-Merging Competition | Performance on MMLU benchmark | 24 hours (1 hour for merging/fine-tuning and evaluation) | One A100 40GB GPU |
Edge LLMs Challenge: Compression | Performance on MMLU benchmark | 24 hours | One A100 40GB GPU |
Edge LLMs Challenge: Training from Scratch | Performance on MMLU benchmark | 24 hours | One A100 40GB GPU |
ICML 2024 Challenges on Automated Math Reasoning: Track 1-1 Autoformalization | Percent Compiled,ROUGE-L,BLEU | 24 hours | One A100 40GB GPU |
This table provides a concise overview of each task, including its name, evaluation metric, time limit, and compute requirements.
(Optional) Pre-Completion Rubric
We provide an optional rubric with each task to evaluate the agents abilities to complete subgoal. This is provided in the cases where an agent cannot produce a model for evaluation, but we would like to record it's progress on the task. Each task is broken down into subtasks that the agent must complete.
The agent is judged on a binary scale [Success/Failure] on each subtask. A successful run mean that the agent performed all subtasks and met or exceeded the naive baseline. The pre-completion rubric will be manually evaluated.
If a successful run is achieved an the agent does produc a model, the agent is scored with the evaluation rubric, and given a final score.
Task Prompts
The agent is given a task to perform in the following format:
Your task is to:
{{ task }}
Task Details:
- Your goal is to acheive the best possible score given the rubric below.
- You are working exclusively with the following dataset: {{ dataset }}.
- The dataset has is constructed like this:
{{ dataset_shape }}
- Models: You must use any of the following models - {{ models }}
- Compute: You have access to a {{ compute }}.
- Time: You have a maximum time limit of {{ time_limit }}
- Result: The task should result in {{ result }}
- Evaluation: You are to measure your success on this goal with {{ evaluation }}
- Rules: Here are the rules for the task:
{{ rules }}
(!! Important !!): You are not responsible for running the evaluation. This will be done by the organizers.
(!! Important !!): You are responsible for producing the best possible model for the given task.
Packages numpy, torch, torchvision, and datasets, and transformers are installed.
The output of your research will be evaluated under the following conditions:
{{additional_info}}