No project description provided
Project description
OpusCleaner
OpusCleaner is a machine translation/language model data cleaner and training scheduler. The Training scheduler has moved to empty-trainer.
Cleaner
The cleaner bit takes care of downloading and cleaning multiple different datasets and preparing them for translation.
Dependencies
(Mainly listed as shortcuts to documentation)
- FastAPI as the base for the backend part.
- Pydantic for conversion of untyped JSON to typed objects. And because FastAPI automatically supports it and gives you useful error messages if you mess up things.
- Vue for frontend
Screenshots
List and categorize the datasets you are going to use for training.
Download more datasets right from the interface.
Filter each individual dataset, showing you the results immediately.
Compare the dataset at different stages of filtering to see what the impact is of each filter.
Paths
data/train-parts
is scanned for datasetsfilters
should contain filter json (but that's not implemented yet, right now it just has a hard-codedFILTERS
dict in code)
Installation for development
python3 -m venv .env
bash --init-file .env/bin/activate
pip install -e .
cd frontend
npm clean-install
npm run build
cd ..
Link the frontend build folder into opuscleaner. Normally this is done during packaging but when opuscleaner is installed as an editable package, this doesn't happen.
ln -s ../frontend/dist opuscleaner/frontend
Finally you can run opuscleaner-server
as normal. The --reload
option will cause it to restart when any of the python files change.
opuscleaner-server --reload
Then go to http://127.0.0.1:8000/ for the "interface" or http://127.0.0.1:8000/docs for the API.
Frontend development
If you're doing frontend development, try also running:
cd frontend
npm run dev
Then go to http://127.0.0.1:5173/ for the "interface".
This will put vite in hot-reloading mode for easier Javascript dev. All API requests will be proxied to the python server running in 8000, which is why you need to run both at the same time.
Filters
If you want to use LASER, you will also need to download its assets:
python -m laserembeddings download-models
Packaging
Run pip wheel .
to build & package OpusCleaner. Packaging is done through hatch, see the pyproject.toml
and build_frontend.py
files for details. You'll need to have a recent version of node
and npm
in your PATH
for this to work.
Acknowledgements
This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file opuscleaner-0.1.tar.gz
.
File metadata
- Download URL: opuscleaner-0.1.tar.gz
- Upload date:
- Size: 424.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.23.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40d44499d9a805e2e4985ca845cee12dfc9c22847dcc385f10df7d99ed024d4a |
|
MD5 | c1ec06c75469824134bb748242e78b8d |
|
BLAKE2b-256 | 607f6004bead44ca11d1c69528c8d92e84f202e14dee1eb465a2ef62a9e76354 |
Provenance
File details
Details for the file opuscleaner-0.1-py3-none-any.whl
.
File metadata
- Download URL: opuscleaner-0.1-py3-none-any.whl
- Upload date:
- Size: 444.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.23.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 876f7c0082a81d013e561eb777d2b84c322d33cada43587338c89f9f76c6bf92 |
|
MD5 | 6cb02b7770a0c31d93023b3933ab655c |
|
BLAKE2b-256 | 7fd9fc5fe36832a2e461a629f0ccf223959b80fe4f93753128c03e72d0011161 |