A template for a private Python package
Project description
Path2Dream Processors
A powerful and extensible file processing library that converts various file types into clean, structured text using cutting-edge AI APIs. Perfect for building AI applications that need to process diverse content sources.
🚀 Features
Supported File Types
- 📄 Documents: PDF files using LlamaParse API (balanced mode)
- 🎵 Audio: MP3, WAV, M4A, FLAC using OpenAI Whisper API
- 🌐 URLs: Web pages using Jina Reader API (clean content extraction)
- 📹 Video: Coming soon
- 🖼️ Images: Coming soon
Key Benefits
- Async Processing: Full async/await support for concurrent file processing
- Clean Content Extraction: Automatically removes ads, cookie banners, and navigation elements from web pages
- High Accuracy: Uses industry-leading APIs (OpenAI Whisper, LlamaParse, Jina Reader)
- Robust Error Handling: Comprehensive error handling with descriptive messages
- Easy Integration: Simple interface with full type hints
- Comprehensive Testing: 95%+ test coverage with both unit and integration tests
📦 Installation
pip install path2dream-processors
🔧 Quick Start
Basic Usage
import asyncio
from path2dream_processors.file_parser import APIBasedFileParser
async def main():
# Initialize the parser
parser = APIBasedFileParser()
# Process multiple files of different types concurrently
files = [
"presentation.pdf", # Document
"meeting_recording.mp3", # Audio
"https://example.com/article" # Web page
]
# Get clean text content (all files processed in parallel)
result = await parser.parse_files(files)
print(result)
# Run the async function
asyncio.run(main())
Environment Setup
Create a .env file with your API keys:
OPENAI_API_KEY=your_openai_api_key_here
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here
JINA_API_KEY=your_jina_api_key_here
🎯 API Reference
FileParser Interface
The main interface for file processing:
from abc import ABC, abstractmethod
from typing import List
class FileParser(ABC):
"""Interface for parsing files to text representation."""
@abstractmethod
async def parse_files(self, file_paths: List[str]) -> str:
"""
Convert files to text representation asynchronously.
Args:
file_paths: List of local file paths or URLs
Returns:
Text representation of file contents
"""
pass
APIBasedFileParser Implementation
Production-ready async file parser with full API integration:
class APIBasedFileParser:
"""Real file parser using APIs with async support."""
async def parse_files(self, file_paths: List[str]) -> str:
"""Parse multiple files concurrently and return combined text representation."""
async def _parse_audio(self, file_path: str) -> str:
"""Parse audio file using OpenAI Whisper API asynchronously."""
async def _parse_document(self, file_path: str) -> str:
"""Parse PDF document using LlamaParse API."""
async def _parse_url(self, file_path: str) -> str:
"""Parse URL content using Jina Reader API asynchronously."""
def _get_file_type(self, file_path: str) -> FileType:
"""Determine file type by extension or URL pattern."""
File Type Detection
The parser automatically detects file types:
from path2dream_processors.file_parser import FileType
# Supported extensions
AUDIO_EXTENSIONS = {'.mp3', '.wav', '.m4a', '.flac', '.ogg', '.aac', '.wma'}
DOCUMENT_EXTENSIONS = {'.pdf', '.docx', '.txt', '.xlsx', ...}
# URLs starting with http:// or https://
📝 Detailed Examples
Processing Audio Files
import asyncio
async def process_audio():
parser = APIBasedFileParser()
# Transcribe audio to text
result = await parser.parse_files(["interview.mp3"])
print(result)
# Output:
# === PARSED FILE CONTENT ===
#
# File: interview.mp3
# Audio transcription: This is the transcribed content from the audio file...
# ----------------------------------------
asyncio.run(process_audio())
Processing Documents
import asyncio
async def process_document():
parser = APIBasedFileParser()
# Extract text from PDF with structure preservation
result = await parser.parse_files(["report.pdf"])
print(result)
# Output:
# === PARSED FILE CONTENT ===
#
# File: report.pdf
# Document content: # Executive Summary
#
# This report analyzes...
# ----------------------------------------
asyncio.run(process_document())
Processing Web Pages
import asyncio
async def process_webpage():
parser = APIBasedFileParser()
# Extract clean content from web pages
result = await parser.parse_files(["https://example.com/article"])
print(result)
# Output:
# === PARSED FILE CONTENT ===
#
# File: example.com
# Web content from: Article Title
#
# Clean article content without ads or navigation...
# ----------------------------------------
asyncio.run(process_webpage())
Mixed File Processing (Concurrent)
import asyncio
async def process_mixed_files():
parser = APIBasedFileParser()
# Process multiple file types concurrently
files = [
"audio.mp3",
"document.pdf",
"https://news.site/article"
]
# All files are processed in parallel for faster execution
result = await parser.parse_files(files)
print(result)
# Returns combined clean text from all sources
asyncio.run(process_mixed_files())
🔑 API Keys Setup
OpenAI API Key
- Visit OpenAI API Keys
- Create a new API key
- Add to your
.envfile asOPENAI_API_KEY
LlamaParse API Key
- Visit LlamaIndex Cloud
- Generate an API key
- Add to your
.envfile asLLAMA_CLOUD_API_KEY
Jina API Key
- Visit Jina AI
- Sign up and get your API key
- Add to your
.envfile asJINA_API_KEY
🧪 Testing
The package includes comprehensive testing with async support:
# Run all tests
pytest
# Run with coverage
pytest --cov=path2dream_processors
# Run only unit tests (fast)
pytest tests/ -k "not Integration"
# Run integration tests (requires API keys)
pytest tests/ -k "Integration"
🏗️ Architecture
Design Principles
- Async-First: Full async/await support for concurrent processing
- Interface Segregation: Clean abstract interface for easy testing and mocking
- Single Responsibility: Each parser method handles one file type
- Dependency Inversion: Relies on external APIs through well-defined interfaces
- Error Handling: Graceful degradation with informative error messages
Extension Points
To add support for new file types:
- Add file extensions to the appropriate constant
- Implement async
_parse_<type>method inAPIBasedFileParser - Add corresponding
FileTypeenum value - Update
_get_file_typemethod - Add comprehensive tests with async support
Example: Adding Image Support
async def _parse_image(self, file_path: str) -> str:
"""Parse image file using vision API."""
try:
# Your async image processing logic here
async with aiohttp.ClientSession() as session:
# API call logic
pass
return f"Image description: {description}"
except Exception as e:
return f"Error processing image: {str(e)}"
🤝 Contributing
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch
- Add tests for your changes (including async tests)
- Ensure all tests pass
- Submit a pull request
Development Setup
# Clone the repository
git clone https://github.com/your-username/path2dream_processors.git
cd path2dream_processors
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
ruff check .
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🆘 Support
- 📚 Documentation: Check this README and inline code documentation
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
🎯 Roadmap
- Image processing with vision APIs
- Video processing with audio extraction
- Batch processing optimization
- Custom output format support
- Streaming API support
- Plugin architecture for custom processors
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file path2dream_processors-0.1.18.tar.gz.
File metadata
- Download URL: path2dream_processors-0.1.18.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
361389ee93114481a87aa99b53c125012ee8d822e868bb186252b22d870837d1
|
|
| MD5 |
3d2e3fd4319f444e5256d1de5e145b2f
|
|
| BLAKE2b-256 |
94b18b418e1a182fa8e656b88fffb03b239207efdf434e9ff75fe222c3af8f37
|
File details
Details for the file path2dream_processors-0.1.18-py3-none-any.whl.
File metadata
- Download URL: path2dream_processors-0.1.18-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e66afe319ea8dec9fac59674db5d2b0a5cf909a42466a14520330ae48ca71ea
|
|
| MD5 |
3a52c067bcb5eec38a26f1b4497240a1
|
|
| BLAKE2b-256 |
2c55fd2c0623e5d487e595372c7514cfd69d98fa30e9e1c01b9bc2c4629fcf6e
|