BabyLM Challenge (Strict-Small)

Task Description

Train a large language model on a pretraining corpus of approximately 10M words (Strict-Small). Optimize for performance on the BLiMP dataset.

Participation Requirements

Use only the provided pretraining corpus
You may not train on BLiMP directly (it's for evaluation only)

Dataset

You can load the dataset with:

load_dataset('AlgorithmicResearchGroup/babylm')

Dataset structure:

DatasetDict({
    train: Dataset({
        features: ['filename', 'content'],
        num_rows: 6
    })
    test: Dataset({
        features: ['filename', 'content'],
        num_rows: 6
    })
    dev: Dataset({
        features: ['filename', 'content'],
        num_rows: 6
    })
})

Evaluation Process

Evaluation will be done on the BLiMP dataset.

You may run the following command to evaluate your model:

python -m lm_eval --model hf \
--model_args pretrained=<path_to_your_model> \
--tasks blimp \
--device cuda:0 \
--batch_size 8

Hardware Constraints

One A100 40GB GPU
128GB of RAM
500GB of Disk

Time Constraints

24 Hour Time Limit

Additional Resources

Starter code: https://github.com/babylm/evaluation-pipeline-2024

Recommended Libraries

Huggingface Transformers