LLM Efficiency Challenge

Task Description

Train a model that performs as well as possible on a wide array of metrics, starting with an approved base model. Using only open-source data, fine-tune a model on an A100 40GB GPU.

Participation Requirements

Use only open-source datasets
Training run must complete in 24 hours
You may not train on the MMLU benchmark directly (it's for evaluation only)
You may not use any data that is not open-source
Training should be done on a single GPU

Datasets

You are welcome to use any open-sourced dataset. Examples include: - Databricks-Dolly-15 - OpenAssistant Conversations Dataset (oasst1) - The Flan Collection - AllenAI Dolma - RedPajama-Data-1T - LIMA

Approved Base Models

ALBERT
BART
BERT
Bloom
Cerebras (btlm, GPT)
Colossal-LLaMA-2-7b-base
DeBERTa
DeciLM-6B
DistilBERT
Electra
Falcon
GPT2
GPT Neo, J, NeoX, Pythia
InternLM
LLaMA or Llama 2
Mistral
MPT
OpenLLaMA
OPT
Persimmon
Qwen
Red Pajama Base (not instruction tuned models)
RoBERTa
T5
UL2

Evaluation Process

Evaluation will be done on a subset of the MMLU benchmark.

You may run the following command to evaluate your model:

python -m lm_eval --model hf \
--model_args pretrained=<path_to_your_model> \
--tasks mmlu \
--device cuda:0 \
--batch_size 8

Hardware Constraints

One A100 40GB GPU
128GB of RAM
500GB of Disk

Time Constraints

24 Hour Time Limit

Additional Resources

Starter code: https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge