Advanced Data Crawling and Processing Framework
Project description
DataMax
中文 | English
Documentation Portal: https://hi-dolphin.github.io/datamax
A powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.
✨ Key Features
- 🔄 Multi-format Support: PDF, DOCX/DOC, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, images, and more
- 🧹 Intelligent Cleaning: Advanced data cleaning with anomaly detection, privacy protection, and text filtering
- 🤖 AI Annotation: LLM-powered automatic annotation and QA generation
- ⚡ High Performance: Efficient batch processing with caching and parallel execution
- 🎯 Developer Friendly: Modern SDK design with type hints, configuration management, and comprehensive error handling
- ☁️ Cloud Ready: Built-in support for OSS, MinIO, and other cloud storage providers
🚀 Quick Start
Install
pip install pydatamax
Examples
from datamax import DataMax
# prepare info
FILE_PATHS = ["/your/file/path/1.md", "/your/file/path/2.doc", "/your/file/path/3.xlsx"]
LABEL_LLM_API_KEY = "YOUR_API_KEY"
LABEL_LLM_BASE_URL = "YOUR_BASE_URL"
LABEL_LLM_MODEL_NAME = "YOUR_MODEL_NAME"
LLM_TRAIN_OUTPUT_FILE_NAME = "train"
# init client
client = DataMax(file_path=FILE_PATHS)
# get data
data = dm.get_data()
# get content
content = data.get("content")
# get pre label. return trainable qa list
qa = dm.get_pre_label(
content=content,
api_key=api_key,
base_url=base_url,
model_name=model,
question_number=50, # question_number_per_chunk
max_qps=100.0,
debug=False,
structured_data=True, # enable structured output
auto_self_review_mode=True, # auto review qa, pass with 4 and 5 score, drop with 1, 2 and 3 score.
review_max_qps=100.0,
)
# save label data
client.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)
📚 Documentation
- See docs:
docs/index.md - Sections: Getting Started, Parsing, Cleaning, Labeling, Crawling, Evaluation, CLI, API, Extending, FAQ
- For the complete text-modal QA generation pipeline, see examples/scripts/generate_qa.py
🤝 Contributing
Issues and Pull Requests are welcome!
📄 License
This project is licensed under the MIT License.
📞 Contact Us
- 📧 Email: cy.kron@foxmail.com, wang.xiangyuxy@outlook.com
- 🐛 Issues: GitHub Issues
- 📚 Documentation: Project Homepage
- 💬 Wechat Group:
⭐ If this project helps you, please give us a star!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydatamax-0.2.1.tar.gz.
File metadata
- Download URL: pydatamax-0.2.1.tar.gz
- Upload date:
- Size: 238.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a12d130cb400a8d27abe042ddb5d6540fdb53d47bc50d5bf2929ef1c850caa1
|
|
| MD5 |
be9d9c96be48fa28c13b182106b74a4c
|
|
| BLAKE2b-256 |
eb140483d5fae59648fcb5610cb6045a0e9172216af9dfd9155df046a557fdb4
|
File details
Details for the file pydatamax-0.2.1-py3-none-any.whl.
File metadata
- Download URL: pydatamax-0.2.1-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a94ea8b2bc6cbbdf2d00b1345c8f5f851743f07e2ebb146645b6e0821b390636
|
|
| MD5 |
311f3a2247e4519896d07f8fc08c578e
|
|
| BLAKE2b-256 |
9a706c44c02a2ff39b7754c49811c226de9c6886d37a3747218e7ee5684c2601
|