Skip to content

BabyLM Challenge (Strict-Small)

Task Description

Train a large language model on a pretraining corpus of approximately 10M words (Strict-Small). Optimize for performance on the BLiMP dataset.

Participation Requirements

  • Use only the provided pretraining corpus
  • You may not train on BLiMP directly (it's for evaluation only)

Dataset

You can load the dataset with:

load_dataset('AlgorithmicResearchGroup/babylm')

Dataset structure:

DatasetDict({
    train: Dataset({
        features: ['filename', 'content'],
        num_rows: 6
    })
    test: Dataset({
        features: ['filename', 'content'],
        num_rows: 6
    })
    dev: Dataset({
        features: ['filename', 'content'],
        num_rows: 6
    })
})

Evaluation Process

Evaluation will be done on the BLiMP dataset.

You may run the following command to evaluate your model:

python -m lm_eval --model hf \
--model_args pretrained=<path_to_your_model> \
--tasks blimp \
--device cuda:0 \
--batch_size 8

Hardware Constraints

  • One A100 40GB GPU
  • 128GB of RAM
  • 500GB of Disk

Time Constraints

  • 24 Hour Time Limit

Additional Resources

Starter code: https://github.com/babylm/evaluation-pipeline-2024

Huggingface Transformers