Skip to main content

Advanced Data Crawling and Processing Framework

Project description

DataMax

中文 | English

PyPI version Python License: MIT

Documentation Portal: https://hi-dolphin.github.io/datamax

A powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.

✨ Key Features

  • 🔄 Multi-format Support: PDF, DOCX/DOC, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, images, and more
  • 🧹 Intelligent Cleaning: Advanced data cleaning with anomaly detection, privacy protection, and text filtering
  • 🤖 AI Annotation: LLM-powered automatic annotation and QA generation
  • High Performance: Efficient batch processing with caching and parallel execution
  • 🎯 Developer Friendly: Modern SDK design with type hints, configuration management, and comprehensive error handling
  • ☁️ Cloud Ready: Built-in support for OSS, MinIO, and other cloud storage providers

🚀 Quick Start

Install

pip install pydatamax

Examples

from datamax import DataMax

# prepare info
FILE_PATHS = ["/your/file/path/1.md", "/your/file/path/2.doc", "/your/file/path/3.xlsx"]
LABEL_LLM_API_KEY = "YOUR_API_KEY"
LABEL_LLM_BASE_URL = "YOUR_BASE_URL"
LABEL_LLM_MODEL_NAME = "YOUR_MODEL_NAME"
LLM_TRAIN_OUTPUT_FILE_NAME = "train"

# init client
client = DataMax(file_path=FILE_PATHS)

# get data
data = dm.get_data()

# get content
content = data.get("content")

# get pre label. return trainable qa list
qa = dm.get_pre_label(
    content=content,
    api_key=api_key,
    base_url=base_url,
    model_name=model,
    question_number=50,  # question_number_per_chunk
    max_qps=100.0,
    debug=False,
    structured_data=True,  # enable structured output
    auto_self_review_mode=True,  # auto review qa, pass with 4 and 5 score, drop with 1, 2 and 3 score.
    review_max_qps=100.0,
)


# save label data
client.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)

📚 Documentation

  • See docs: docs/index.md
  • Sections: Getting Started, Parsing, Cleaning, Labeling, Crawling, Evaluation, CLI, API, Extending, FAQ
  • For the complete text-modal QA generation pipeline, see examples/scripts/generate_qa.py

🤝 Contributing

Issues and Pull Requests are welcome!

📄 License

This project is licensed under the MIT License.

📞 Contact Us


⭐ If this project helps you, please give us a star!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydatamax-0.2.1.tar.gz (238.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydatamax-0.2.1-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file pydatamax-0.2.1.tar.gz.

File metadata

  • Download URL: pydatamax-0.2.1.tar.gz
  • Upload date:
  • Size: 238.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pydatamax-0.2.1.tar.gz
Algorithm Hash digest
SHA256 9a12d130cb400a8d27abe042ddb5d6540fdb53d47bc50d5bf2929ef1c850caa1
MD5 be9d9c96be48fa28c13b182106b74a4c
BLAKE2b-256 eb140483d5fae59648fcb5610cb6045a0e9172216af9dfd9155df046a557fdb4

See more details on using hashes here.

File details

Details for the file pydatamax-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: pydatamax-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pydatamax-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a94ea8b2bc6cbbdf2d00b1345c8f5f851743f07e2ebb146645b6e0821b390636
MD5 311f3a2247e4519896d07f8fc08c578e
BLAKE2b-256 9a706c44c02a2ff39b7754c49811c226de9c6886d37a3747218e7ee5684c2601

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page