A template for a private Python package

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Path2Dream Processors

A powerful and extensible file processing library that converts various file types into clean, structured text using cutting-edge AI APIs. Perfect for building AI applications that need to process diverse content sources.

🚀 Features

Supported File Types

📄 Documents: PDF files using LlamaParse API (balanced mode)
🎵 Audio: MP3, WAV, M4A, FLAC using OpenAI Whisper API
🌐 URLs: Web pages using Jina Reader API (clean content extraction)
📹 Video: Coming soon
🖼️ Images: Coming soon

Key Benefits

Async Processing: Full async/await support for concurrent file processing
Clean Content Extraction: Automatically removes ads, cookie banners, and navigation elements from web pages
High Accuracy: Uses industry-leading APIs (OpenAI Whisper, LlamaParse, Jina Reader)
Robust Error Handling: Comprehensive error handling with descriptive messages
Easy Integration: Simple interface with full type hints
Comprehensive Testing: 95%+ test coverage with both unit and integration tests

📦 Installation

pip install path2dream-processors

🔧 Quick Start

Basic Usage

import asyncio
from path2dream_processors.file_parser import APIBasedFileParser

async def main():
    # Initialize the parser
    parser = APIBasedFileParser()

    # Process multiple files of different types concurrently
    files = [
        "presentation.pdf",           # Document
        "meeting_recording.mp3",      # Audio
        "https://example.com/article" # Web page
    ]

    # Get clean text content (all files processed in parallel)
    result = await parser.parse_files(files)
    print(result)

# Run the async function
asyncio.run(main())

Environment Setup

Create a .env file with your API keys:

OPENAI_API_KEY=your_openai_api_key_here
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here  
JINA_API_KEY=your_jina_api_key_here

🎯 API Reference

FileParser Interface

The main interface for file processing:

from abc import ABC, abstractmethod
from typing import List

class FileParser(ABC):
    """Interface for parsing files to text representation."""
    
    @abstractmethod
    async def parse_files(self, file_paths: List[str]) -> str:
        """
        Convert files to text representation asynchronously.
        
        Args:
            file_paths: List of local file paths or URLs
            
        Returns:
            Text representation of file contents
        """
        pass

APIBasedFileParser Implementation

Production-ready async file parser with full API integration:

class APIBasedFileParser:
    """Real file parser using APIs with async support."""
    
    async def parse_files(self, file_paths: List[str]) -> str:
        """Parse multiple files concurrently and return combined text representation."""
        
    async def _parse_audio(self, file_path: str) -> str:
        """Parse audio file using OpenAI Whisper API asynchronously."""
        
    async def _parse_document(self, file_path: str) -> str:
        """Parse PDF document using LlamaParse API."""
        
    async def _parse_url(self, file_path: str) -> str:
        """Parse URL content using Jina Reader API asynchronously."""
        
    def _get_file_type(self, file_path: str) -> FileType:
        """Determine file type by extension or URL pattern."""

File Type Detection

The parser automatically detects file types:

from path2dream_processors.file_parser import FileType

# Supported extensions
AUDIO_EXTENSIONS = {'.mp3', '.wav', '.m4a', '.flac', '.ogg', '.aac', '.wma'}
DOCUMENT_EXTENSIONS = {'.pdf', '.docx', '.txt', '.xlsx', ...}
# URLs starting with http:// or https://

📝 Detailed Examples

Processing Audio Files

import asyncio

async def process_audio():
    parser = APIBasedFileParser()

    # Transcribe audio to text
    result = await parser.parse_files(["interview.mp3"])
    print(result)
    # Output:
    # === PARSED FILE CONTENT ===
    # 
    # File: interview.mp3
    # Audio transcription: This is the transcribed content from the audio file...
    # ----------------------------------------

asyncio.run(process_audio())

Processing Documents

import asyncio

async def process_document():
    parser = APIBasedFileParser()
    
    # Extract text from PDF with structure preservation
    result = await parser.parse_files(["report.pdf"])
    print(result)
    # Output:
    # === PARSED FILE CONTENT ===
    # 
    # File: report.pdf  
    # Document content: # Executive Summary
    # 
    # This report analyzes...
    # ----------------------------------------

asyncio.run(process_document())

Processing Web Pages

import asyncio

async def process_webpage():
    parser = APIBasedFileParser()
    
    # Extract clean content from web pages
    result = await parser.parse_files(["https://example.com/article"])
    print(result)
    # Output:
    # === PARSED FILE CONTENT ===
    # 
    # File: example.com
    # Web content from: Article Title
    # 
    # Clean article content without ads or navigation...
    # ----------------------------------------

asyncio.run(process_webpage())

Mixed File Processing (Concurrent)

import asyncio

async def process_mixed_files():
    parser = APIBasedFileParser()
    
    # Process multiple file types concurrently
    files = [
        "audio.mp3",
        "document.pdf", 
        "https://news.site/article"
    ]

    # All files are processed in parallel for faster execution
    result = await parser.parse_files(files)
    print(result)
    # Returns combined clean text from all sources

asyncio.run(process_mixed_files())

🔑 API Keys Setup

OpenAI API Key

Visit OpenAI API Keys
Create a new API key
Add to your .env file as OPENAI_API_KEY

LlamaParse API Key

Visit LlamaIndex Cloud
Generate an API key
Add to your .env file as LLAMA_CLOUD_API_KEY

Jina API Key

Visit Jina AI
Sign up and get your API key
Add to your .env file as JINA_API_KEY

🧪 Testing

The package includes comprehensive testing with async support:

# Run all tests
pytest

# Run with coverage
pytest --cov=path2dream_processors

# Run only unit tests (fast)
pytest tests/ -k "not Integration"

# Run integration tests (requires API keys)
pytest tests/ -k "Integration"

🏗️ Architecture

Design Principles

Async-First: Full async/await support for concurrent processing
Interface Segregation: Clean abstract interface for easy testing and mocking
Single Responsibility: Each parser method handles one file type
Dependency Inversion: Relies on external APIs through well-defined interfaces
Error Handling: Graceful degradation with informative error messages

Extension Points

To add support for new file types:

Add file extensions to the appropriate constant
Implement async _parse_<type> method in APIBasedFileParser
Add corresponding FileType enum value
Update _get_file_type method
Add comprehensive tests with async support

Example: Adding Image Support

async def _parse_image(self, file_path: str) -> str:
    """Parse image file using vision API."""
    try:
        # Your async image processing logic here
        async with aiohttp.ClientSession() as session:
            # API call logic
            pass
        return f"Image description: {description}"
    except Exception as e:
        return f"Error processing image: {str(e)}"

🤝 Contributing

We welcome contributions! Please see our contributing guidelines:

Fork the repository
Create a feature branch
Add tests for your changes (including async tests)
Ensure all tests pass
Submit a pull request

Development Setup

# Clone the repository
git clone https://github.com/your-username/path2dream_processors.git
cd path2dream_processors

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

📚 Documentation: Check this README and inline code documentation
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

🎯 Roadmap

Image processing with vision APIs
Video processing with audio extraction
Batch processing optimization
Custom output format support
Streaming API support
Plugin architecture for custom processors

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.19

Jun 2, 2025

This version

0.1.18

Jun 1, 2025

0.1.17

Jun 1, 2025

0.1.16

Jun 1, 2025

0.1.15

Jun 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

path2dream_processors-0.1.18.tar.gz (12.3 kB view details)

Uploaded Jun 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

path2dream_processors-0.1.18-py3-none-any.whl (8.9 kB view details)

Uploaded Jun 1, 2025 Python 3

File details

Details for the file path2dream_processors-0.1.18.tar.gz.

File metadata

Download URL: path2dream_processors-0.1.18.tar.gz
Upload date: Jun 1, 2025
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for path2dream_processors-0.1.18.tar.gz
Algorithm	Hash digest
SHA256	`361389ee93114481a87aa99b53c125012ee8d822e868bb186252b22d870837d1`
MD5	`3d2e3fd4319f444e5256d1de5e145b2f`
BLAKE2b-256	`94b18b418e1a182fa8e656b88fffb03b239207efdf434e9ff75fe222c3af8f37`

See more details on using hashes here.

File details

Details for the file path2dream_processors-0.1.18-py3-none-any.whl.

File metadata

Download URL: path2dream_processors-0.1.18-py3-none-any.whl
Upload date: Jun 1, 2025
Size: 8.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for path2dream_processors-0.1.18-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e66afe319ea8dec9fac59674db5d2b0a5cf909a42466a14520330ae48ca71ea`
MD5	`3a52c067bcb5eec38a26f1b4497240a1`
BLAKE2b-256	`2c55fd2c0623e5d487e595372c7514cfd69d98fa30e9e1c01b9bc2c4629fcf6e`

See more details on using hashes here.

path2dream-processors 0.1.18

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Path2Dream Processors

🚀 Features

Supported File Types

Key Benefits

📦 Installation

🔧 Quick Start

Basic Usage

Environment Setup

🎯 API Reference

FileParser Interface

APIBasedFileParser Implementation

File Type Detection

📝 Detailed Examples

Processing Audio Files

Processing Documents

Processing Web Pages

Mixed File Processing (Concurrent)

🔑 API Keys Setup

OpenAI API Key

LlamaParse API Key

Jina API Key

🧪 Testing

🏗️ Architecture

Design Principles

Extension Points

Example: Adding Image Support

🤝 Contributing

Development Setup

📄 License

🆘 Support

🎯 Roadmap

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes