Universal, extensible parser for many file types and websites with a clean plugin architecture.

These details have not been verified by PyPI

Project links

Project description

panparsex

Pan-parse anything. A universal, extensible parser that normalizes content from files and websites into a single, clean schema.

Features

🧩 Plugin Architecture: Add new parsers without touching core code
📄 Comprehensive Support: Text, JSON, YAML, XML, HTML, PDF, CSV, DOCX, Markdown, RTF, Excel, PowerPoint, and more
🌐 Web Scraping: Intelligent website crawling with robots.txt respect and JavaScript extraction
🧠 Smart Detection: Auto-detection by MIME type, file extension, and content analysis
🔁 Recursive Processing: Folder traversal and website crawling with configurable depth
🧪 Clean Schema: Unified Pydantic-based output format for all content types
🛠️ Zero Configuration: Works out of the box with sensible defaults
🚀 High Performance: Optimized for speed and memory efficiency

Installation

pip install panparsex

Quick Start

Command Line Interface

# Parse a single file
panparsex parse document.pdf

# Parse a website with recursive crawling
panparsex parse https://example.com --recursive --max-links 50 --max-depth 2

# Parse a directory recursively
panparsex parse ./documents --recursive --glob '**/*'

# Pretty-print output
panparsex parse document.html --pretty

Python API

from panparsex import parse

# Parse a file
doc = parse("document.pdf")
print(doc.meta.title)
print(doc.sections[0].chunks[0].text)

# Parse a website
doc = parse("https://example.com", recursive=True, max_links=10)
for section in doc.sections:
    print(f"Section: {section.heading}")
    for chunk in section.chunks:
        print(f"  {chunk.text[:100]}...")

# Parse with custom options
doc = parse("data.csv", content_type="text/csv")
print(doc.meta.extra["csv_data"]["headers"])

Supported File Types

Type	Extensions	Description
Text	`.txt`	Plain text files
JSON	`.json`	JSON documents with structured data
YAML	`.yml`, `.yaml`	YAML configuration files
XML	`.xml`	XML documents
HTML	`.html`, `.htm`, `.xhtml`	HTML web pages with metadata extraction
PDF	`.pdf`	PDF documents with page-by-page extraction
CSV	`.csv`	Comma-separated values with header detection
Markdown	`.md`, `.markdown`	Markdown documents with structure preservation
Word	`.docx`	Microsoft Word documents
Excel	`.xlsx`, `.xls`	Excel spreadsheets with sheet extraction
PowerPoint	`.pptx`	PowerPoint presentations with slide extraction
RTF	`.rtf`	Rich Text Format documents
Web	`http://`, `https://`	Websites with intelligent content extraction

Output Schema

All parsed content follows a unified schema:

class UnifiedDocument(BaseModel):
    schema_id: str = "panparsex/v1"
    meta: Metadata
    sections: List[Section]

class Metadata(BaseModel):
    source: str
    content_type: str
    title: Optional[str]
    url: Optional[str]
    path: Optional[str]
    extra: Dict[str, Any]

class Section(BaseModel):
    heading: Optional[str]
    chunks: List[Chunk]
    meta: Dict[str, Any]

class Chunk(BaseModel):
    text: str
    order: int
    meta: Dict[str, Any]

Advanced Usage

Web Scraping with JavaScript

# Extract JavaScript content from websites
doc = parse("https://spa-website.com", extract_js=True)

# Find JavaScript sections
for section in doc.sections:
    if section.meta.get("type") == "javascript":
        print(f"JS from {section.meta['url']}: {section.chunks[0].text[:200]}...")

Custom Parser Registration

from panparsex import register_parser, ParserProtocol
from panparsex.types import UnifiedDocument, Metadata

class CustomParser(ParserProtocol):
    name = "custom"
    content_types = ("application/custom",)
    extensions = (".custom",)
    
    def can_parse(self, meta: Metadata) -> bool:
        return meta.content_type == "application/custom"
    
    def parse(self, target, meta: Metadata, recursive: bool = False, **kwargs) -> UnifiedDocument:
        # Your parsing logic here
        return UnifiedDocument(meta=meta, sections=[])

# Register the parser
register_parser(CustomParser())

Batch Processing

import os
from pathlib import Path
from panparsex import parse

def process_directory(directory: str):
    """Process all files in a directory."""
    results = []
    
    for file_path in Path(directory).rglob("*"):
        if file_path.is_file():
            try:
                doc = parse(str(file_path))
                results.append({
                    "file": str(file_path),
                    "title": doc.meta.title,
                    "content_length": sum(len(chunk.text) for section in doc.sections for chunk in section.chunks),
                    "sections": len(doc.sections)
                })
            except Exception as e:
                print(f"Error processing {file_path}: {e}")
    
    return results

# Process a directory
results = process_directory("./documents")
for result in results:
    print(f"{result['file']}: {result['sections']} sections, {result['content_length']} chars")

Configuration

Environment Variables

PANPARSEX_USER_AGENT: Custom user agent for web scraping
PANPARSEX_TIMEOUT: Request timeout in seconds (default: 15)
PANPARSEX_DELAY: Delay between requests in seconds (default: 0)

CLI Options

panparsex parse [OPTIONS] TARGET

Options:
  --recursive              Enable recursive processing
  --glob TEXT              Glob pattern for directory processing
  --max-links INTEGER      Maximum links to follow (web scraping)
  --max-depth INTEGER      Maximum crawl depth (web scraping)
  --same-origin            Restrict crawling to same origin
  --pretty                 Pretty-print JSON output
  --help                   Show help message

Examples

Extract Text from PDF

from panparsex import parse

doc = parse("report.pdf")
for section in doc.sections:
    print(f"Page {section.meta.get('page_number', 'Unknown')}:")
    print(section.chunks[0].text[:200] + "...")

Parse Excel Spreadsheet

from panparsex import parse

doc = parse("data.xlsx")
for section in doc.sections:
    if section.meta.get("type") == "sheet":
        print(f"Sheet: {section.meta['sheet_name']}")
        print(f"Rows: {section.meta['rows']}, Cols: {section.meta['cols']}")
        print(section.chunks[0].text[:300] + "...")

Scrape Website Content

from panparsex import parse

doc = parse("https://news-website.com", recursive=True, max_links=20, max_depth=2)

print(f"Crawled {doc.meta.extra['pages_parsed']} pages")
print(f"Unique domains: {doc.meta.extra['crawl_stats']['unique_domains']}")

for section in doc.sections:
    if section.meta.get("url"):
        print(f"\nFrom {section.meta['url']}:")
        print(f"Title: {section.heading}")
        print(f"Content: {section.chunks[0].text[:200]}...")

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Adding New Parsers

Create a new parser class implementing ParserProtocol
Add it to the parsers/ directory
Register it in the core module
Add tests and documentation

Development Setup

git clone https://github.com/dhruvildarji/panparsex.git
cd panparsex
pip install -e ".[dev]"
pytest

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Changelog

v0.1.0 (2024-01-XX)

Initial release
Support for 13+ file types
Web scraping capabilities
Plugin architecture
Comprehensive test suite

Support

📧 Email: dhruvil.darji@gmail.com
🐛 Issues: GitHub Issues
📖 Documentation: GitHub Wiki

Roadmap

OCR support for scanned documents
Audio/video transcription
Database connection parsing
Cloud storage integration
Advanced web scraping (Selenium support)
Content deduplication
Language detection
Sentiment analysis integration

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.2

Oct 5, 2025

0.5.1

Oct 5, 2025

0.5.0

Oct 4, 2025

0.3.0

Oct 4, 2025

0.2.2

Oct 1, 2025

0.2.1

Oct 1, 2025

0.2.0

Oct 1, 2025

0.1.10

Oct 1, 2025

0.1.7

Oct 1, 2025

This version

0.1.0

Oct 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panparsex-0.1.0.tar.gz (42.2 kB view details)

Uploaded Oct 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

panparsex-0.1.0-py3-none-any.whl (24.5 kB view details)

Uploaded Oct 1, 2025 Python 3

File details

Details for the file panparsex-0.1.0.tar.gz.

File metadata

Download URL: panparsex-0.1.0.tar.gz
Upload date: Oct 1, 2025
Size: 42.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for panparsex-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c21e1ae660901ca4a382eed6660cb52ee591bcadcc82f31eded5c9795ec5156f`
MD5	`dfb1c861d081761ecc0112f949345586`
BLAKE2b-256	`e541e5cc8b905835bdcdc484b03b3b38b5f43b30b4e6703e0836ca36382545c6`

See more details on using hashes here.

File details

Details for the file panparsex-0.1.0-py3-none-any.whl.

File metadata

Download URL: panparsex-0.1.0-py3-none-any.whl
Upload date: Oct 1, 2025
Size: 24.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for panparsex-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`972d7f85079140b8830be47b67e64932d825091243ed7640c254cb27d042b785`
MD5	`121f6e827ee9abd3027f3272a605d174`
BLAKE2b-256	`bc1c1ef9f3be2aba54f74fb59300fe766fd64cc2b9badbedc826cac62dc494e3`

See more details on using hashes here.

panparsex 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

panparsex

Features

Installation

Quick Start

Command Line Interface

Python API

Supported File Types

Output Schema

Advanced Usage

Web Scraping with JavaScript

Custom Parser Registration

Batch Processing

Configuration

Environment Variables

CLI Options

Examples

Extract Text from PDF

Parse Excel Spreadsheet

Scrape Website Content

Contributing

Adding New Parsers

Development Setup

License

Changelog

v0.1.0 (2024-01-XX)

Support

Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes