data processing pipeline with deduplication, stemming, quality checking, and readability scoring, used for the DALLA Models
Project description
Dalla Data Processing (dalla-dp)
A comprehensive Arabic data processing pipeline with deduplication, stemming, quality checking, and readability scoring, used for the DALLA Models.
Compatibility
- Linux: Fully supported
- macOS: Fully supported (Intel or through rosetta)
- Windows: Supported through WSL (Windows Subsystem for Linux) only, for native windows: manual build from source works for deduplication.
Installation
Quick Start (All Features)
For most users, install with all features enabled:
Using uv
uv pip install "dalla-data-processing[all]"
Using pip
pip install "dalla-data-processing[all]"
Modular Installation (Advanced)
Install only the components you need to keep dependencies minimal:
# Base installation (no processing features, only core dependencies)
pip install dalla-data-processing
# Install specific features
pip install "dalla-data-processing[dedup]" # Deduplication only
pip install "dalla-data-processing[stem]" # Stemming only
pip install "dalla-data-processing[quality]" # Quality checking only
pip install "dalla-data-processing[readability]" # Readability scoring only
pip install "dalla-data-processing[pack]" # Dataset packing only
# Combine multiple features
pip install "dalla-data-processing[dedup,stem,quality]"
Development Installation
From Source (with uv - recommended)
git clone https://github.com/U4RASD/dalla-data-processing.git
cd dalla-data-processing
# Install all features and dev dependencies
uv sync --all-extras
# Or install with specific extras only
uv sync --extra dedup --extra stem
From Source (with pip)
git clone https://github.com/U4RASD/dalla-data-processing.git
cd dalla-data-processing
# Install with all features for development
pip install -e ".[all,dev]"
Components
Note: Each component requires its corresponding extra to be installed. Install with
[all]to enable all features, or see Modular Installation to install only what you need.
1. Deduplication
Detect and remove duplicate or near-duplicate documents from your datasets using the Onion algorithm.
- Requires:
[dedup]extra
2. Stemming
Apply morphological analysis and stemming using CAMeL Tools.
- Requires:
[stem]extra
3. Quality Checking
Check text quality using morphological analysis to detect errors and foreign words.
- Requires:
[quality]extra
4. Readability Scoring
Calculate readability scores using Flesch Reading Ease and Osman methods. Contains also ranking according to both scores
- Requires:
[readability]extra
5. Dataset Packing
Pack and prepare datasets for training.
- Requires:
[pack]extra
Links
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dalla_data_processing-0.0.11.tar.gz.
File metadata
- Download URL: dalla_data_processing-0.0.11.tar.gz
- Upload date:
- Size: 426.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82b40086efd5dffd0fe5e5f1c964807a044aee97669e754c8606ef52c3c9f59d
|
|
| MD5 |
9e037b0c404766bba1bff083c9bff913
|
|
| BLAKE2b-256 |
3e3ab2064a47286c9921909e631b4c2cd0e2c589f4cd9386a931fa0d25b963f7
|
File details
Details for the file dalla_data_processing-0.0.11-py3-none-any.whl.
File metadata
- Download URL: dalla_data_processing-0.0.11-py3-none-any.whl
- Upload date:
- Size: 452.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c6e496bfe2d55c1cb7bd30a187fef28c101480bafb290e567b694baf5ead3b5
|
|
| MD5 |
9ff3206d96b38c31431a8e3e424a303d
|
|
| BLAKE2b-256 |
1f9a4c97b32326de3ef73f1d734a0022fa22f5532c2f11264b90b8705b63bd07
|