Skip to main content

data processing pipeline with deduplication, stemming, quality checking, and readability scoring, used for the DALLA Models

Project description

Dalla Data Processing (dalla-dp)

A comprehensive Arabic data processing pipeline with deduplication, stemming, quality checking, and readability scoring, used for the DALLA Models.

Compatibility

  • Linux: Fully supported
  • macOS: Fully supported (Intel or through rosetta)
  • Windows: Supported through WSL (Windows Subsystem for Linux) only, for native windows: manual build from source works for deduplication.

Installation

Quick Start (All Features)

For most users, install with all features enabled:

Using uv

uv pip install "dalla-data-processing[all]"

Using pip

pip install "dalla-data-processing[all]"

Modular Installation (Advanced)

Install only the components you need to keep dependencies minimal:

# Base installation (no processing features, only core dependencies)
pip install dalla-data-processing

# Install specific features
pip install "dalla-data-processing[dedup]"        # Deduplication only
pip install "dalla-data-processing[stem]"         # Stemming only
pip install "dalla-data-processing[quality]"      # Quality checking only
pip install "dalla-data-processing[readability]"  # Readability scoring only
pip install "dalla-data-processing[pack]"         # Dataset packing only

# Combine multiple features
pip install "dalla-data-processing[dedup,stem,quality]"

Development Installation

From Source (with uv - recommended)

git clone https://github.com/U4RASD/dalla-data-processing.git
cd dalla-data-processing

# Install all features and dev dependencies
uv sync --all-extras

# Or install with specific extras only
uv sync --extra dedup --extra stem

From Source (with pip)

git clone https://github.com/U4RASD/dalla-data-processing.git
cd dalla-data-processing

# Install with all features for development
pip install -e ".[all,dev]"

Components

Note: Each component requires its corresponding extra to be installed. Install with [all] to enable all features, or see Modular Installation to install only what you need.

1. Deduplication

Detect and remove duplicate or near-duplicate documents from your datasets using the Onion algorithm.

  • Requires: [dedup] extra

2. Stemming

Apply morphological analysis and stemming using CAMeL Tools.

  • Requires: [stem] extra

3. Quality Checking

Check text quality using morphological analysis to detect errors and foreign words.

  • Requires: [quality] extra

4. Readability Scoring

Calculate readability scores using Flesch Reading Ease and Osman methods. Contains also ranking according to both scores

  • Requires: [readability] extra

5. Dataset Packing

Pack and prepare datasets for training.

  • Requires: [pack] extra

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dalla_data_processing-0.0.11.tar.gz (426.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dalla_data_processing-0.0.11-py3-none-any.whl (452.1 kB view details)

Uploaded Python 3

File details

Details for the file dalla_data_processing-0.0.11.tar.gz.

File metadata

  • Download URL: dalla_data_processing-0.0.11.tar.gz
  • Upload date:
  • Size: 426.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dalla_data_processing-0.0.11.tar.gz
Algorithm Hash digest
SHA256 82b40086efd5dffd0fe5e5f1c964807a044aee97669e754c8606ef52c3c9f59d
MD5 9e037b0c404766bba1bff083c9bff913
BLAKE2b-256 3e3ab2064a47286c9921909e631b4c2cd0e2c589f4cd9386a931fa0d25b963f7

See more details on using hashes here.

File details

Details for the file dalla_data_processing-0.0.11-py3-none-any.whl.

File metadata

File hashes

Hashes for dalla_data_processing-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 1c6e496bfe2d55c1cb7bd30a187fef28c101480bafb290e567b694baf5ead3b5
MD5 9ff3206d96b38c31431a8e3e424a303d
BLAKE2b-256 1f9a4c97b32326de3ef73f1d734a0022fa22f5532c2f11264b90b8705b63bd07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page