Skip to main content

A library for parsing and converting various file formats.

Project description

DataMax

中文 | English

PyPI version Python License: MIT

A powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.

✨ Key Features

  • 🔄 Multi-format Support: PDF, DOCX/DOC, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, images, and more
  • 🧹 Intelligent Cleaning: Advanced data cleaning with anomaly detection, privacy protection, and text filtering
  • 🤖 AI Annotation: LLM-powered automatic annotation and QA generation
  • High Performance: Efficient batch processing with caching and parallel execution
  • 🎯 Developer Friendly: Modern SDK design with type hints, configuration management, and comprehensive error handling
  • ☁️ Cloud Ready: Built-in support for OSS, MinIO, and other cloud storage providers

🚀 Quick Start

Install

pip install pydatamax

Examples

from datamax import DataMax

# prepare info
FILE_PATHS = ["/your/file/path/1.md", "/your/file/path/2.doc", "/your/file/path/3.xlsx"]
LABEL_LLM_API_KEY = "YOUR_API_KEY"
LABEL_LLM_BASE_URL = "YOUR_BASE_URL"
LABEL_LLM_MODEL_NAME = "YOUR_MODEL_NAME"
LLM_TRAIN_OUTPUT_FILE_NAME = "train"

# init client
client = DataMax(file_path=FILE_PATHS)

# get pre label. return trainable qa list
qa_list = client.get_pre_label(
    api_key=LABEL_LLM_API_KEY,
    base_url=LABEL_LLM_BASE_URL,
    model_name=LABEL_LLM_MODEL_NAME,
    question_number=10,
    max_workers=5)

# save label data
client.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)

🤝 Contributing

Issues and Pull Requests are welcome!

📄 License

This project is licensed under the MIT License.

📞 Contact Us


⭐ If this project helps you, please give us a star!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydatamax-0.1.24.tar.gz (97.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydatamax-0.1.24-py3-none-any.whl (98.6 kB view details)

Uploaded Python 3

File details

Details for the file pydatamax-0.1.24.tar.gz.

File metadata

  • Download URL: pydatamax-0.1.24.tar.gz
  • Upload date:
  • Size: 97.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pydatamax-0.1.24.tar.gz
Algorithm Hash digest
SHA256 4e00f86fba1c58b17942c6fb49e41907230270c2e06378165d517f57843064f3
MD5 638bea47029705908bcc5b665c560460
BLAKE2b-256 86972fce93537a9f04fe89553b529cd1cd9288eb16d9162b3bc7befe9017395c

See more details on using hashes here.

File details

Details for the file pydatamax-0.1.24-py3-none-any.whl.

File metadata

  • Download URL: pydatamax-0.1.24-py3-none-any.whl
  • Upload date:
  • Size: 98.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pydatamax-0.1.24-py3-none-any.whl
Algorithm Hash digest
SHA256 e7f33740d118f2ed7c397f69924dbf4690c4a99632745eba14028b954f642eb7
MD5 d098b2848bfed892ac54334d1646541e
BLAKE2b-256 ae5194d6eb277c0ac2d8958d148d92080457442e2b81d58f5d37f9f6cb010fd7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page