No project description provided
Project description
engawa
NOT YET FULLY TESTED
A simple implementation to pre-train BART from scratch with your own corpus.
Usage
Soon, I will make this pip-installable with CLI commands but at the moment, you need to run it as a repository.
Installation
git clone git@github.com:sobamchan/engawa.git && cd engawa
poetry install
Build tokenizer
python engawa/tokenizer.py --data-path /path/to/train.txt --save-dir /path/to/save
# Checkout other options by
python engawa/tokenizer.py -h
Pre-train BART
python engawa/train.py --tokenizer-file /path/to/tokenizer.json --train-file /path/to/train.txt --val-file /path/to/val.txt --default-root-dir /path/to/save/things
# Checkout other options by
python engawa/train.py -h
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
engawa-0.1.3.tar.gz
(8.3 kB
view hashes)
Built Distribution
engawa-0.1.3-py3-none-any.whl
(9.6 kB
view hashes)