No project description provided
Project description
📚 Rdlab Dataset
Overview
Rdlab Dataset is a collection of Khmer datasets made for research & development (R&D) in Cambodia. It provides ready-to-use datasets and utilities for text and image processing.
Datasets included:
- Khmer Words
- Khmer Addresses
- Khmer Sentences
- Khmer Combination Locations
- Khmer-English Combined Dataset
- Text-to-Image Generators (with noise)
📦 Dataset Statistics (v0.3.0)
| Dataset Name | Number of Records |
|---|---|
| Khmer Word Dataset | 784,011 |
| Khmer Address Dataset | 20,817 |
| Khmer Sentence Dataset | 211,928 |
| Combination Location in Cambodia | 200,000 |
| Khmer English Dataset | 545,941 |
📥 Installation
Install using pip:
pip install rdlab_dataset
Best example
from rdlab_dataset.module import KhmerDatasetLoader, TextArrayListImageGenerator
# Load datasets
word_loader = KhmerDatasetLoader("word")
address_loader = KhmerDatasetLoader("address")
sentence_loader = KhmerDatasetLoader("sentence")
# Get data
words = word_loader.get_all()
addresses = address_loader.get_all()
sentences = sentence_loader.get_all()
# Combine into one array
combined_texts = words + addresses + sentences
print(f"Total combined items: {len(combined_texts)}")
# Image generator
text_image_gen = TextArrayListImageGenerator(
customize_font=True,
folder_limit=10,
output_count=4,
num_threads=4
)
# Generate images from address list
text_image_gen.generate_images(
text_list=addresses,
font_folder="/home/vitoupro/code/rdlab-dataset/test_font"
)
🧠 Module Overview
- KhmerDatasetLoader — Unified loader for all dataset types
- ATextImageGenerator — Generate noisy images from single text
- TextArrayListImageGenerator — Generate noisy images from a list with annotations
🔍 Usage
📝 KhmerDatasetLoader
from rdlab_dataset.module import KhmerDatasetLoader
# Supported types: "word", "address", "sentence", "location", "khmer_english"
loader = KhmerDatasetLoader("word")
# Access methods
all_data = loader.get_all()
first = loader.get_first()
first_five = loader.get_n_first(5)
exists = loader.find("សេចក្ដី")
length = len(loader)
print(first, first_five, exists, length)
🖼 TextArrayListImageGenerator (Text List with Annotations)
from rdlab_dataset.module import TextArrayListImageGenerator
texts = ["ខ្ញុំស្រឡាញ់វេយ្យាករណ៍", "ភាសាខ្មែរ"]
gen = TextArrayListImageGenerator(customize_font=False)
gen.generate_images(texts, save_as_pickle=False, output_count=3)
✅ Features
- Easy dataset loading (.pkl format)
- Support .ttf font and .jpg background assets
- Built-in noisy image generation (Gaussian, Speckle, Salt & Pepper, etc.)
- Clean data APIs: get all, first, or n-first records
- Lightweight, fast, and ready for machine learning pipelines
🤝 Contributing
- Fork, commit, and submit a pull request. All contributions welcome!
📜 License
Rdlab Community License
- 📲 Telegram: 0964060587
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rdlab_dataset-0.4.4.tar.gz
(23.5 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rdlab_dataset-0.4.4.tar.gz.
File metadata
- Download URL: rdlab_dataset-0.4.4.tar.gz
- Upload date:
- Size: 23.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2500f58a00011b30c8e092e87c6b2fc19b0c6c70ebc9efbbe97dc8be8fd1c1a5
|
|
| MD5 |
324b0451353ae370351800612b00f470
|
|
| BLAKE2b-256 |
5caa3090bf78eb26743c26be0ec06f1e8df753e5b12493514754a89029aa627c
|
File details
Details for the file rdlab_dataset-0.4.4-py3-none-any.whl.
File metadata
- Download URL: rdlab_dataset-0.4.4-py3-none-any.whl
- Upload date:
- Size: 24.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52f324d7574e3b04d48b8ab82505dcd813d5833c5b16980a9c9f300ae1a6278a
|
|
| MD5 |
bffda63f4ef4c947f87c664419101b77
|
|
| BLAKE2b-256 |
292f14ae3e4f9f9af16223e1c1403104fcc2109d0447057c3f8620a1d7060e3a
|