Skip to content

LLM Efficiency Challenge

Task Description

Train a model that performs as well as possible on a wide array of metrics, starting with an approved base model. Using only open-source data, fine-tune a model on an A100 40GB GPU.

Participation Requirements

  • Use only open-source datasets
  • Training run must complete in 24 hours
  • You may not train on the MMLU benchmark directly (it's for evaluation only)
  • You may not use any data that is not open-source
  • Training should be done on a single GPU

Datasets

You are welcome to use any open-sourced dataset. Examples include: - Databricks-Dolly-15 - OpenAssistant Conversations Dataset (oasst1) - The Flan Collection - AllenAI Dolma - RedPajama-Data-1T - LIMA

Approved Base Models

  • ALBERT
  • BART
  • BERT
  • Bloom
  • Cerebras (btlm, GPT)
  • Colossal-LLaMA-2-7b-base
  • DeBERTa
  • DeciLM-6B
  • DistilBERT
  • Electra
  • Falcon
  • GPT2
  • GPT Neo, J, NeoX, Pythia
  • InternLM
  • LLaMA or Llama 2
  • Mistral
  • MPT
  • OpenLLaMA
  • OPT
  • Persimmon
  • Qwen
  • Red Pajama Base (not instruction tuned models)
  • RoBERTa
  • T5
  • UL2

Evaluation Process

Evaluation will be done on a subset of the MMLU benchmark.

You may run the following command to evaluate your model:

python -m lm_eval --model hf \
--model_args pretrained=<path_to_your_model> \
--tasks mmlu \
--device cuda:0 \
--batch_size 8

Hardware Constraints

  • One A100 40GB GPU
  • 128GB of RAM
  • 500GB of Disk

Time Constraints

  • 24 Hour Time Limit

Additional Resources

Starter code: https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge