Skip to main content

Tools for downloading, processing, and normalizing Kabyle and Occitan language corpora from Tatoeba

Project description

Kabyle Corpus Toolkit

Python 3.8+ License: MIT

Tools for downloading, processing, and normalizing Kabyle (kab) and Occitan (oci) language corpora from Tatoeba and other sources.

Features

  • Download Tatoeba Data: Automated download of sentences and links from Tatoeba.org
  • Parallel Corpus Creation: Build aligned English-Kabyle and English-Occitan sentence pairs
  • French Chain Translation: Expand coverage by routing Kabyle→French→English translations
  • Character Normalization: Fix encoding issues and normalize extended Latin characters
  • Language Validation: Validate corpus quality using GlotLID FastText models
  • Stopword Generation: Generate language-specific stopword lists from corpus statistics

Installation

Basic Installation

pip install kabyle-corpus-toolkit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kabyle_corpus_toolkit-2.0.0.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kabyle_corpus_toolkit-2.0.0-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file kabyle_corpus_toolkit-2.0.0.tar.gz.

File metadata

  • Download URL: kabyle_corpus_toolkit-2.0.0.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for kabyle_corpus_toolkit-2.0.0.tar.gz
Algorithm Hash digest
SHA256 6ad9b8b3b979b50e00179832871096dca5b160fc875ae7089c460b4b63f135f9
MD5 0ac888dd4a6c5402ff97b25a5c033fb3
BLAKE2b-256 2a9f2aa95bbce7c1ac1fa8a5c8f234ae9e16ae2362f9a7830faad5808b5b6eef

See more details on using hashes here.

File details

Details for the file kabyle_corpus_toolkit-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for kabyle_corpus_toolkit-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f92b9a976b301cbfd01b054664fe3ae4adaa9a53c6207d5ccbb6e9ed3e95e34
MD5 7235cc7c1abc46852c9f8982bcd987c0
BLAKE2b-256 fbe479ba20e1afb00e5b5592e5868e2153db44c3e269d69533304a75ddb020ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page