Skip to main content

Universal, extensible parser for many file types and websites with a clean plugin architecture.

Project description

panparsex

Pan-parse anything. A universal, extensible parser that normalizes content from files and websites into a single, clean schema.

PyPI version Python 3.9+ License: Apache 2.0

Features

  • 🧩 Plugin Architecture: Add new parsers without touching core code
  • 📄 Comprehensive Support: Text, JSON, YAML, XML, HTML, PDF, CSV, DOCX, Markdown, RTF, Excel, PowerPoint, and more
  • 🌐 Web Scraping: Intelligent website crawling with robots.txt respect and JavaScript extraction
  • 🧠 Smart Detection: Auto-detection by MIME type, file extension, and content analysis
  • 🔁 Recursive Processing: Folder traversal and website crawling with configurable depth
  • 🧪 Clean Schema: Unified Pydantic-based output format for all content types
  • 🛠️ Zero Configuration: Works out of the box with sensible defaults
  • 🚀 High Performance: Optimized for speed and memory efficiency

Installation

pip install panparsex

Quick Start

Command Line Interface

# Parse a single file
panparsex parse document.pdf

# Parse a website with recursive crawling
panparsex parse https://example.com --recursive --max-links 50 --max-depth 2

# Parse a directory recursively
panparsex parse ./documents --recursive --glob '**/*'

# Pretty-print output
panparsex parse document.html --pretty

Python API

from panparsex import parse

# Parse a file
doc = parse("document.pdf")
print(doc.meta.title)
print(doc.sections[0].chunks[0].text)

# Parse a website
doc = parse("https://example.com", recursive=True, max_links=10)
for section in doc.sections:
    print(f"Section: {section.heading}")
    for chunk in section.chunks:
        print(f"  {chunk.text[:100]}...")

# Parse with custom options
doc = parse("data.csv", content_type="text/csv")
print(doc.meta.extra["csv_data"]["headers"])

Supported File Types

Type Extensions Description
Text .txt Plain text files
JSON .json JSON documents with structured data
YAML .yml, .yaml YAML configuration files
XML .xml XML documents
HTML .html, .htm, .xhtml HTML web pages with metadata extraction
PDF .pdf PDF documents with page-by-page extraction
CSV .csv Comma-separated values with header detection
Markdown .md, .markdown Markdown documents with structure preservation
Word .docx Microsoft Word documents
Excel .xlsx, .xls Excel spreadsheets with sheet extraction
PowerPoint .pptx PowerPoint presentations with slide extraction
RTF .rtf Rich Text Format documents
Web http://, https:// Websites with intelligent content extraction

Output Schema

All parsed content follows a unified schema:

class UnifiedDocument(BaseModel):
    schema_id: str = "panparsex/v1"
    meta: Metadata
    sections: List[Section]

class Metadata(BaseModel):
    source: str
    content_type: str
    title: Optional[str]
    url: Optional[str]
    path: Optional[str]
    extra: Dict[str, Any]

class Section(BaseModel):
    heading: Optional[str]
    chunks: List[Chunk]
    meta: Dict[str, Any]

class Chunk(BaseModel):
    text: str
    order: int
    meta: Dict[str, Any]

Advanced Usage

Web Scraping with JavaScript

# Extract JavaScript content from websites
doc = parse("https://spa-website.com", extract_js=True)

# Find JavaScript sections
for section in doc.sections:
    if section.meta.get("type") == "javascript":
        print(f"JS from {section.meta['url']}: {section.chunks[0].text[:200]}...")

Custom Parser Registration

from panparsex import register_parser, ParserProtocol
from panparsex.types import UnifiedDocument, Metadata

class CustomParser(ParserProtocol):
    name = "custom"
    content_types = ("application/custom",)
    extensions = (".custom",)
    
    def can_parse(self, meta: Metadata) -> bool:
        return meta.content_type == "application/custom"
    
    def parse(self, target, meta: Metadata, recursive: bool = False, **kwargs) -> UnifiedDocument:
        # Your parsing logic here
        return UnifiedDocument(meta=meta, sections=[])

# Register the parser
register_parser(CustomParser())

Batch Processing

import os
from pathlib import Path
from panparsex import parse

def process_directory(directory: str):
    """Process all files in a directory."""
    results = []
    
    for file_path in Path(directory).rglob("*"):
        if file_path.is_file():
            try:
                doc = parse(str(file_path))
                results.append({
                    "file": str(file_path),
                    "title": doc.meta.title,
                    "content_length": sum(len(chunk.text) for section in doc.sections for chunk in section.chunks),
                    "sections": len(doc.sections)
                })
            except Exception as e:
                print(f"Error processing {file_path}: {e}")
    
    return results

# Process a directory
results = process_directory("./documents")
for result in results:
    print(f"{result['file']}: {result['sections']} sections, {result['content_length']} chars")

Configuration

Environment Variables

  • PANPARSEX_USER_AGENT: Custom user agent for web scraping
  • PANPARSEX_TIMEOUT: Request timeout in seconds (default: 15)
  • PANPARSEX_DELAY: Delay between requests in seconds (default: 0)

CLI Options

panparsex parse [OPTIONS] TARGET

Options:
  --recursive              Enable recursive processing
  --glob TEXT              Glob pattern for directory processing
  --max-links INTEGER      Maximum links to follow (web scraping)
  --max-depth INTEGER      Maximum crawl depth (web scraping)
  --same-origin            Restrict crawling to same origin
  --pretty                 Pretty-print JSON output
  --help                   Show help message

Examples

Extract Text from PDF

from panparsex import parse

doc = parse("report.pdf")
for section in doc.sections:
    print(f"Page {section.meta.get('page_number', 'Unknown')}:")
    print(section.chunks[0].text[:200] + "...")

Parse Excel Spreadsheet

from panparsex import parse

doc = parse("data.xlsx")
for section in doc.sections:
    if section.meta.get("type") == "sheet":
        print(f"Sheet: {section.meta['sheet_name']}")
        print(f"Rows: {section.meta['rows']}, Cols: {section.meta['cols']}")
        print(section.chunks[0].text[:300] + "...")

Scrape Website Content

from panparsex import parse

doc = parse("https://news-website.com", recursive=True, max_links=20, max_depth=2)

print(f"Crawled {doc.meta.extra['pages_parsed']} pages")
print(f"Unique domains: {doc.meta.extra['crawl_stats']['unique_domains']}")

for section in doc.sections:
    if section.meta.get("url"):
        print(f"\nFrom {section.meta['url']}:")
        print(f"Title: {section.heading}")
        print(f"Content: {section.chunks[0].text[:200]}...")

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Adding New Parsers

  1. Create a new parser class implementing ParserProtocol
  2. Add it to the parsers/ directory
  3. Register it in the core module
  4. Add tests and documentation

Development Setup

git clone https://github.com/dhruvildarji/panparsex.git
cd panparsex
pip install -e ".[dev]"
pytest

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Changelog

v0.1.0 (2024-01-XX)

  • Initial release
  • Support for 13+ file types
  • Web scraping capabilities
  • Plugin architecture
  • Comprehensive test suite

Support

Roadmap

  • OCR support for scanned documents
  • Audio/video transcription
  • Database connection parsing
  • Cloud storage integration
  • Advanced web scraping (Selenium support)
  • Content deduplication
  • Language detection
  • Sentiment analysis integration

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panparsex-0.1.0.tar.gz (42.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

panparsex-0.1.0-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file panparsex-0.1.0.tar.gz.

File metadata

  • Download URL: panparsex-0.1.0.tar.gz
  • Upload date:
  • Size: 42.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for panparsex-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c21e1ae660901ca4a382eed6660cb52ee591bcadcc82f31eded5c9795ec5156f
MD5 dfb1c861d081761ecc0112f949345586
BLAKE2b-256 e541e5cc8b905835bdcdc484b03b3b38b5f43b30b4e6703e0836ca36382545c6

See more details on using hashes here.

File details

Details for the file panparsex-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: panparsex-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for panparsex-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 972d7f85079140b8830be47b67e64932d825091243ed7640c254cb27d042b785
MD5 121f6e827ee9abd3027f3272a605d174
BLAKE2b-256 bc1c1ef9f3be2aba54f74fb59300fe766fd64cc2b9badbedc826cac62dc494e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page