Skip to main content

No project description provided

Project description

OpusCleaner

OpusCleaner is a machine translation/language model data cleaner and training scheduler. The Training scheduler has moved to empty-trainer.

Cleaner

The cleaner bit takes care of downloading and cleaning multiple different datasets and preparing them for translation.

Dependencies

(Mainly listed as shortcuts to documentation)

  • FastAPI as the base for the backend part.
  • Pydantic for conversion of untyped JSON to typed objects. And because FastAPI automatically supports it and gives you useful error messages if you mess up things.
  • Vue for frontend

Screenshots

List and categorize the datasets you are going to use for training.

Download more datasets right from the interface.

Filter each individual dataset, showing you the results immediately.

Compare the dataset at different stages of filtering to see what the impact is of each filter.

Paths

  • data/train-parts is scanned for datasets
  • filters should contain filter json (but that's not implemented yet, right now it just has a hard-coded FILTERS dict in code)

Installation for development

python3 -m venv .env
bash --init-file .env/bin/activate
pip install -e .

cd frontend
npm clean-install
npm run build
cd ..

Link the frontend build folder into opuscleaner. Normally this is done during packaging but when opuscleaner is installed as an editable package, this doesn't happen.

ln -s ../frontend/dist opuscleaner/frontend

Finally you can run opuscleaner-server as normal. The --reload option will cause it to restart when any of the python files change.

opuscleaner-server --reload

Then go to http://127.0.0.1:8000/ for the "interface" or http://127.0.0.1:8000/docs for the API.

Frontend development

If you're doing frontend development, try also running:

cd frontend
npm run dev

Then go to http://127.0.0.1:5173/ for the "interface".

This will put vite in hot-reloading mode for easier Javascript dev. All API requests will be proxied to the python server running in 8000, which is why you need to run both at the same time.

Filters

If you want to use LASER, you will also need to download its assets:

python -m laserembeddings download-models

Packaging

Run pip wheel . to build & package OpusCleaner. Packaging is done through hatch, see the pyproject.toml and build_frontend.py files for details. You'll need to have a recent version of node and npm in your PATH for this to work.

Acknowledgements

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opuscleaner-0.1.tar.gz (424.4 kB view details)

Uploaded Source

Built Distribution

opuscleaner-0.1-py3-none-any.whl (444.6 kB view details)

Uploaded Python 3

File details

Details for the file opuscleaner-0.1.tar.gz.

File metadata

  • Download URL: opuscleaner-0.1.tar.gz
  • Upload date:
  • Size: 424.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.3

File hashes

Hashes for opuscleaner-0.1.tar.gz
Algorithm Hash digest
SHA256 40d44499d9a805e2e4985ca845cee12dfc9c22847dcc385f10df7d99ed024d4a
MD5 c1ec06c75469824134bb748242e78b8d
BLAKE2b-256 607f6004bead44ca11d1c69528c8d92e84f202e14dee1eb465a2ef62a9e76354

See more details on using hashes here.

Provenance

File details

Details for the file opuscleaner-0.1-py3-none-any.whl.

File metadata

  • Download URL: opuscleaner-0.1-py3-none-any.whl
  • Upload date:
  • Size: 444.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.3

File hashes

Hashes for opuscleaner-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 876f7c0082a81d013e561eb777d2b84c322d33cada43587338c89f9f76c6bf92
MD5 6cb02b7770a0c31d93023b3933ab655c
BLAKE2b-256 7fd9fc5fe36832a2e461a629f0ccf223959b80fe4f93753128c03e72d0011161

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page