Skip to main content

No project description provided

Project description

📚 Rdlab Dataset

Overview

Rdlab Dataset is a collection of Khmer datasets made for research & development (R&D) in Cambodia. It provides ready-to-use datasets and utilities for text and image processing.

Datasets included:

  • Khmer Words
  • Khmer Addresses
  • Khmer Sentences
  • Khmer Combination Locations
  • Khmer-English Combined Dataset
  • Text-to-Image Generators (with noise)

📦 Dataset Statistics (v0.3.0)

Dataset Name Number of Records
Khmer Word Dataset 784,011
Khmer Address Dataset 20,817
Khmer Sentence Dataset 211,928
Combination Location in Cambodia 200,000
Khmer English Dataset 545,941

📥 Installation

Install using pip:

pip install rdlab_dataset

Best example

from rdlab_dataset.module import KhmerDatasetLoader, TextArrayListImageGenerator

# Load datasets
word_loader = KhmerDatasetLoader("word")
address_loader = KhmerDatasetLoader("address")
sentence_loader = KhmerDatasetLoader("sentence")

# Get data
words = word_loader.get_all()
addresses = address_loader.get_all()
sentences = sentence_loader.get_all()

# Combine into one array
combined_texts = words + addresses + sentences
print(f"Total combined items: {len(combined_texts)}")

# Image generator
text_image_gen = TextArrayListImageGenerator(
    customize_font=True,
    folder_limit=10,
    output_count=4,
    num_threads=4
)

# Generate images from address list
text_image_gen.generate_images(
    text_list=addresses,
    font_folder="/home/vitoupro/code/rdlab-dataset/test_font"
)

🧠 Module Overview

  • KhmerDatasetLoader — Unified loader for all dataset types
  • ATextImageGenerator — Generate noisy images from single text
  • TextArrayListImageGenerator — Generate noisy images from a list with annotations

🔍 Usage

📝 KhmerDatasetLoader

from rdlab_dataset.module import KhmerDatasetLoader

# Supported types: "word", "address", "sentence", "location", "khmer_english"
loader = KhmerDatasetLoader("word")

# Access methods
all_data = loader.get_all()
first = loader.get_first()
first_five = loader.get_n_first(5)
exists = loader.find("សេចក្ដី")
length = len(loader)

print(first, first_five, exists, length)

🖼 TextArrayListImageGenerator (Text List with Annotations)

from rdlab_dataset.module import TextArrayListImageGenerator

texts = ["ខ្ញុំស្រឡាញ់វេយ្យាករណ៍", "ភាសាខ្មែរ"]
gen = TextArrayListImageGenerator(customize_font=False)
gen.generate_images(texts, save_as_pickle=False, output_count=3)

✅ Features

  • Easy dataset loading (.pkl format)
  • Support .ttf font and .jpg background assets
  • Built-in noisy image generation (Gaussian, Speckle, Salt & Pepper, etc.)
  • Clean data APIs: get all, first, or n-first records
  • Lightweight, fast, and ready for machine learning pipelines

🤝 Contributing

  • Fork, commit, and submit a pull request. All contributions welcome!

📜 License

Rdlab Community License

  • 📲 Telegram: 0964060587

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdlab_dataset-0.4.4.tar.gz (23.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdlab_dataset-0.4.4-py3-none-any.whl (24.6 MB view details)

Uploaded Python 3

File details

Details for the file rdlab_dataset-0.4.4.tar.gz.

File metadata

  • Download URL: rdlab_dataset-0.4.4.tar.gz
  • Upload date:
  • Size: 23.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for rdlab_dataset-0.4.4.tar.gz
Algorithm Hash digest
SHA256 2500f58a00011b30c8e092e87c6b2fc19b0c6c70ebc9efbbe97dc8be8fd1c1a5
MD5 324b0451353ae370351800612b00f470
BLAKE2b-256 5caa3090bf78eb26743c26be0ec06f1e8df753e5b12493514754a89029aa627c

See more details on using hashes here.

File details

Details for the file rdlab_dataset-0.4.4-py3-none-any.whl.

File metadata

  • Download URL: rdlab_dataset-0.4.4-py3-none-any.whl
  • Upload date:
  • Size: 24.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for rdlab_dataset-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 52f324d7574e3b04d48b8ab82505dcd813d5833c5b16980a9c9f300ae1a6278a
MD5 bffda63f4ef4c947f87c664419101b77
BLAKE2b-256 292f14ae3e4f9f9af16223e1c1403104fcc2109d0447057c3f8620a1d7060e3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page