Paper - Pytorch
Project description
Doc Master 📚
A powerful, lightweight Python library for automated file reading and content extraction. Doc Master simplifies the process of reading various file formats into string representations, making it perfect for data processing, content analysis, and document management systems.
🚀 Features
-
Universal File Reading: Seamlessly handle multiple file formats including:
- PDF documents
- Microsoft Word documents (.docx)
- Excel spreadsheets
- Text files
- XML documents
- Images (with base64 encoding)
- Binary files
-
Smart Format Detection: Automatic file type detection and appropriate processing
-
Flexible Output: Choose between string or dictionary output formats
-
Batch Processing: Process entire folders of documents efficiently
-
Encoding Detection: Smart encoding detection for text files
-
Enterprise-Ready: Built with stability and performance in mind
📦 Installation
pip install -U doc-master
🔧 Quick Start
from doc_master import doc_master
# Read all files in a folder
results = doc_master(folder_path="path/to/folder", output_type="dict")
# Or read a single file
content = doc_master(file_path="path/to/file.docx")
📋 Requirements
- Python 3.8+
- pandas
- pypdf
- python-docx
- Pillow
🤝 Contributing
We love your input! We want to make contributing to Doc Master as easy and transparent as possible. Here's how you can help:
- Fork the repo
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Check out our Contributing Guidelines for more details.
🌟 Support the Project
If you find Doc Master useful, please consider:
- Starring the repository ⭐
- Following us on GitHub
- Joining our Discord community
- Sharing the project with others
📖 Documentation
For detailed documentation, visit our Wiki.
Basic Usage Examples
# Read a PDF file
content = read_single_file("document.pdf")
# Read an Excel file with specific sheet
reader = AutoFileReader()
content = reader.read_file("spreadsheet.xlsx", sheet_name="Data")
# Process a folder of documents
results = doc_master(
folder_path="documents/",
output_type="dict"
)
🔍 Error Handling
The library includes comprehensive error handling:
try:
content = read_single_file("file.pdf")
except Exception as e:
print(f"Error processing file: {e}")
🛣️ Roadmap
- Add OCR capabilities for image processing
- Support for additional file formats
- Performance optimizations for large files
- Async file processing
- CLI interface
💬 Community and Support
- Join our Discord server for discussions and support
- Check out our GitHub Issues for bug reports and feature requests
- Follow our GitHub Discussions for general questions
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- All our amazing contributors
- The open-source community
- The Swarm Corporation team
Made with ❤️ by The Swarm Corporation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for doc_master-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ab872c5eda063c306a55242ab2e18cb5dc34ec47a23a050d0a6d54146a8eb53 |
|
MD5 | 9412114ab30c4915d323076bec35475b |
|
BLAKE2b-256 | 4fa284e87aa54f52aaf749da5aa8a99f1d1ec626c19af60d737e757fd98c3920 |