Skip to main content

Universal, extensible parser for many file types and websites with a clean plugin architecture.

Project description

panparsex

Pan-parse anything. A universal, extensible parser that normalizes content from files and websites into a single, clean schema.

PyPI version Python 3.9+ License: Apache 2.0

Features

  • 🧩 Plugin Architecture: Add new parsers without touching core code
  • 📄 Comprehensive Support: Text, JSON, YAML, XML, HTML, PDF, CSV, DOCX, Markdown, RTF, Excel, PowerPoint, and more
  • 🌐 Web Scraping: Intelligent website crawling with robots.txt respect and JavaScript extraction
  • 🧠 Smart Detection: Auto-detection by MIME type, file extension, and content analysis
  • 🔁 Recursive Processing: Folder traversal and website crawling with configurable depth
  • 🧪 Clean Schema: Unified Pydantic-based output format for all content types
  • 🤖 AI-Powered Processing: Use OpenAI GPT to analyze, restructure, and filter parsed content
  • 🛠️ Zero Configuration: Works out of the box with sensible defaults
  • 🚀 High Performance: Optimized for speed and memory efficiency

Installation

pip install panparsex

Quick Start

Command Line Interface

# Parse a single file
panparsex parse document.pdf

# Parse a website with recursive crawling
panparsex parse https://example.com --recursive --max-links 50 --max-depth 2

# Parse a directory recursively
panparsex parse ./documents --recursive --glob '**/*'

# Pretty-print output
panparsex parse document.html --pretty

# Parse with AI processing
panparsex parse document.pdf --ai-process --ai-output analysis.json

# Parse website with AI analysis
panparsex parse https://example.com --ai-process --ai-format markdown --ai-task "Extract key information and create summary"

Python API

from panparsex import parse

# Parse a file
doc = parse("document.pdf")
print(doc.meta.title)
print(doc.sections[0].chunks[0].text)

# Parse with AI processing
from panparsex.ai_processor import AIProcessor

processor = AIProcessor(api_key="your-openai-key")
result = processor.process_and_save(
    doc,
    "analysis.json",
    task="Analyze and restructure the content",
    output_format="structured_json"
)

# Parse a website
doc = parse("https://example.com", recursive=True, max_links=10)
for section in doc.sections:
    print(f"Section: {section.heading}")
    for chunk in section.chunks:
        print(f"  {chunk.text[:100]}...")

# Parse with custom options
doc = parse("data.csv", content_type="text/csv")
print(doc.meta.extra["csv_data"]["headers"])

Supported File Types

Type Extensions Description
Text .txt Plain text files
JSON .json JSON documents with structured data
YAML .yml, .yaml YAML configuration files
XML .xml XML documents
HTML .html, .htm, .xhtml HTML web pages with metadata extraction
PDF .pdf PDF documents with page-by-page extraction
CSV .csv Comma-separated values with header detection
Markdown .md, .markdown Markdown documents with structure preservation
Word .docx Microsoft Word documents
Excel .xlsx, .xls Excel spreadsheets with sheet extraction
PowerPoint .pptx PowerPoint presentations with slide extraction
RTF .rtf Rich Text Format documents
Web http://, https:// Websites with intelligent content extraction

Output Schema

All parsed content follows a unified schema:

class UnifiedDocument(BaseModel):
    schema_id: str = "panparsex/v1"
    meta: Metadata
    sections: List[Section]

class Metadata(BaseModel):
    source: str
    content_type: str
    title: Optional[str]
    url: Optional[str]
    path: Optional[str]
    extra: Dict[str, Any]

class Section(BaseModel):
    heading: Optional[str]
    chunks: List[Chunk]
    meta: Dict[str, Any]

class Chunk(BaseModel):
    text: str
    order: int
    meta: Dict[str, Any]

Advanced Usage

Web Scraping with JavaScript

# Extract JavaScript content from websites
doc = parse("https://spa-website.com", extract_js=True)

# Find JavaScript sections
for section in doc.sections:
    if section.meta.get("type") == "javascript":
        print(f"JS from {section.meta['url']}: {section.chunks[0].text[:200]}...")

Custom Parser Registration

from panparsex import register_parser, ParserProtocol
from panparsex.types import UnifiedDocument, Metadata

class CustomParser(ParserProtocol):
    name = "custom"
    content_types = ("application/custom",)
    extensions = (".custom",)
    
    def can_parse(self, meta: Metadata) -> bool:
        return meta.content_type == "application/custom"
    
    def parse(self, target, meta: Metadata, recursive: bool = False, **kwargs) -> UnifiedDocument:
        # Your parsing logic here
        return UnifiedDocument(meta=meta, sections=[])

# Register the parser
register_parser(CustomParser())

Batch Processing

import os
from pathlib import Path
from panparsex import parse

def process_directory(directory: str):
    """Process all files in a directory."""
    results = []
    
    for file_path in Path(directory).rglob("*"):
        if file_path.is_file():
            try:
                doc = parse(str(file_path))
                results.append({
                    "file": str(file_path),
                    "title": doc.meta.title,
                    "content_length": sum(len(chunk.text) for section in doc.sections for chunk in section.chunks),
                    "sections": len(doc.sections)
                })
            except Exception as e:
                print(f"Error processing {file_path}: {e}")
    
    return results

# Process a directory
results = process_directory("./documents")
for result in results:
    print(f"{result['file']}: {result['sections']} sections, {result['content_length']} chars")

Configuration

Environment Variables

  • PANPARSEX_USER_AGENT: Custom user agent for web scraping
  • PANPARSEX_TIMEOUT: Request timeout in seconds (default: 15)
  • PANPARSEX_DELAY: Delay between requests in seconds (default: 0)
  • OPENAI_API_KEY: OpenAI API key for AI processing features

CLI Options

panparsex parse [OPTIONS] TARGET

Options:
  --recursive              Enable recursive processing
  --glob TEXT              Glob pattern for directory processing
  --max-links INTEGER      Maximum links to follow (web scraping)
  --max-depth INTEGER      Maximum crawl depth (web scraping)
  --same-origin            Restrict crawling to same origin
  --pretty                 Pretty-print JSON output
  --ai-process             Process with AI after parsing
  --ai-task TEXT           AI task description
  --ai-format TEXT         AI output format (structured_json, markdown, summary)
  --ai-output TEXT         Output file for AI-processed result
  --openai-key TEXT        OpenAI API key
  --ai-model TEXT          OpenAI model to use (default: gpt-4o-mini)
  --ai-tokens INTEGER      Max tokens for AI response (default: 4000)
  --ai-temperature FLOAT   AI temperature 0.0-1.0 (default: 0.3)
  --help                   Show help message

Examples

Extract Text from PDF

from panparsex import parse

doc = parse("report.pdf")
for section in doc.sections:
    print(f"Page {section.meta.get('page_number', 'Unknown')}:")
    print(section.chunks[0].text[:200] + "...")

Parse Excel Spreadsheet

from panparsex import parse

doc = parse("data.xlsx")
for section in doc.sections:
    if section.meta.get("type") == "sheet":
        print(f"Sheet: {section.meta['sheet_name']}")
        print(f"Rows: {section.meta['rows']}, Cols: {section.meta['cols']}")
        print(section.chunks[0].text[:300] + "...")

Scrape Website Content

from panparsex import parse

doc = parse("https://news-website.com", recursive=True, max_links=20, max_depth=2)

print(f"Crawled {doc.meta.extra['pages_parsed']} pages")
print(f"Unique domains: {doc.meta.extra['crawl_stats']['unique_domains']}")

for section in doc.sections:
    if section.meta.get("url"):
        print(f"\nFrom {section.meta['url']}:")
        print(f"Title: {section.heading}")
        print(f"Content: {section.chunks[0].text[:200]}...")

AI-Powered Content Analysis

from panparsex import parse
from panparsex.ai_processor import AIProcessor

# Parse a document
doc = parse("business_report.pdf")

# Process with AI
processor = AIProcessor(api_key="your-openai-key")
result = processor.process_and_save(
    doc,
    "analysis.json",
    task="Extract key metrics, identify challenges, and provide recommendations",
    output_format="structured_json"
)

# The result will contain structured analysis
print("Summary:", result.get("summary"))
print("Key Topics:", result.get("key_topics"))
print("Recommendations:", result.get("recommendations"))

AI Processing with Custom Task

from panparsex import parse
from panparsex.ai_processor import AIProcessor

# Parse a website
doc = parse("https://company.com", recursive=True, max_links=10)

# Custom AI analysis
processor = AIProcessor(api_key="your-openai-key")
result = processor.process_and_save(
    doc,
    "company_analysis.md",
    task="Analyze the company's services, extract contact information, and identify key features",
    output_format="markdown"
)

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Adding New Parsers

  1. Create a new parser class implementing ParserProtocol
  2. Add it to the parsers/ directory
  3. Register it in the core module
  4. Add tests and documentation

Development Setup

git clone https://github.com/dhruvildarji/panparsex.git
cd panparsex
pip install -e ".[dev]"
pytest

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Changelog

v0.1.0 (2024-01-XX)

  • Initial release
  • Support for 13+ file types
  • Web scraping capabilities
  • Plugin architecture
  • Comprehensive test suite

Support

Roadmap

  • OCR support for scanned documents
  • Audio/video transcription
  • Database connection parsing
  • Cloud storage integration
  • Advanced web scraping (Selenium support)
  • Content deduplication
  • Language detection
  • Sentiment analysis integration

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panparsex-0.2.0.tar.gz (49.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

panparsex-0.2.0-py3-none-any.whl (29.1 kB view details)

Uploaded Python 3

File details

Details for the file panparsex-0.2.0.tar.gz.

File metadata

  • Download URL: panparsex-0.2.0.tar.gz
  • Upload date:
  • Size: 49.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for panparsex-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ecbeeb495c80c85b9449985585f3038ee27a24fa520209de31e7362dc5d1548d
MD5 ca03b3468152eb2247059c976d33add9
BLAKE2b-256 746b7bd90c1a7b56c9714c274c551b4fcf61972bc978eee518d7959185ddf9b1

See more details on using hashes here.

File details

Details for the file panparsex-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: panparsex-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 29.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for panparsex-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b5b000f9c8213b5dbdf46b9bcdadb78e1d36b36a62cbbb5ede55966fe000c4c
MD5 41fc776adc5ae15e0dd5d7fde7cce751
BLAKE2b-256 4a892349f3b7abff54992e70dff05dba47d272302941d068eeccd6636018e38f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page