Universal, extensible parser for many file types and websites with a clean plugin architecture.
Project description
panparsex
Pan-parse anything. A universal, extensible parser that normalizes content from files and websites into a single, clean schema.
Features
- 🧩 Plugin Architecture: Add new parsers without touching core code
- 📄 Comprehensive Support: Text, JSON, YAML, XML, HTML, PDF, CSV, DOCX, Markdown, RTF, Excel, PowerPoint, and more
- 🌐 Web Scraping: Intelligent website crawling with robots.txt respect and JavaScript extraction
- 🧠 Smart Detection: Auto-detection by MIME type, file extension, and content analysis
- 🔁 Recursive Processing: Folder traversal and website crawling with configurable depth
- 🧪 Clean Schema: Unified Pydantic-based output format for all content types
- 🛠️ Zero Configuration: Works out of the box with sensible defaults
- 🚀 High Performance: Optimized for speed and memory efficiency
Installation
pip install panparsex
Quick Start
Command Line Interface
# Parse a single file
panparsex parse document.pdf
# Parse a website with recursive crawling
panparsex parse https://example.com --recursive --max-links 50 --max-depth 2
# Parse a directory recursively
panparsex parse ./documents --recursive --glob '**/*'
# Pretty-print output
panparsex parse document.html --pretty
Python API
from panparsex import parse
# Parse a file
doc = parse("document.pdf")
print(doc.meta.title)
print(doc.sections[0].chunks[0].text)
# Parse a website
doc = parse("https://example.com", recursive=True, max_links=10)
for section in doc.sections:
print(f"Section: {section.heading}")
for chunk in section.chunks:
print(f" {chunk.text[:100]}...")
# Parse with custom options
doc = parse("data.csv", content_type="text/csv")
print(doc.meta.extra["csv_data"]["headers"])
Supported File Types
| Type | Extensions | Description |
|---|---|---|
| Text | .txt |
Plain text files |
| JSON | .json |
JSON documents with structured data |
| YAML | .yml, .yaml |
YAML configuration files |
| XML | .xml |
XML documents |
| HTML | .html, .htm, .xhtml |
HTML web pages with metadata extraction |
.pdf |
PDF documents with page-by-page extraction | |
| CSV | .csv |
Comma-separated values with header detection |
| Markdown | .md, .markdown |
Markdown documents with structure preservation |
| Word | .docx |
Microsoft Word documents |
| Excel | .xlsx, .xls |
Excel spreadsheets with sheet extraction |
| PowerPoint | .pptx |
PowerPoint presentations with slide extraction |
| RTF | .rtf |
Rich Text Format documents |
| Web | http://, https:// |
Websites with intelligent content extraction |
Output Schema
All parsed content follows a unified schema:
class UnifiedDocument(BaseModel):
schema_id: str = "panparsex/v1"
meta: Metadata
sections: List[Section]
class Metadata(BaseModel):
source: str
content_type: str
title: Optional[str]
url: Optional[str]
path: Optional[str]
extra: Dict[str, Any]
class Section(BaseModel):
heading: Optional[str]
chunks: List[Chunk]
meta: Dict[str, Any]
class Chunk(BaseModel):
text: str
order: int
meta: Dict[str, Any]
Advanced Usage
Web Scraping with JavaScript
# Extract JavaScript content from websites
doc = parse("https://spa-website.com", extract_js=True)
# Find JavaScript sections
for section in doc.sections:
if section.meta.get("type") == "javascript":
print(f"JS from {section.meta['url']}: {section.chunks[0].text[:200]}...")
Custom Parser Registration
from panparsex import register_parser, ParserProtocol
from panparsex.types import UnifiedDocument, Metadata
class CustomParser(ParserProtocol):
name = "custom"
content_types = ("application/custom",)
extensions = (".custom",)
def can_parse(self, meta: Metadata) -> bool:
return meta.content_type == "application/custom"
def parse(self, target, meta: Metadata, recursive: bool = False, **kwargs) -> UnifiedDocument:
# Your parsing logic here
return UnifiedDocument(meta=meta, sections=[])
# Register the parser
register_parser(CustomParser())
Batch Processing
import os
from pathlib import Path
from panparsex import parse
def process_directory(directory: str):
"""Process all files in a directory."""
results = []
for file_path in Path(directory).rglob("*"):
if file_path.is_file():
try:
doc = parse(str(file_path))
results.append({
"file": str(file_path),
"title": doc.meta.title,
"content_length": sum(len(chunk.text) for section in doc.sections for chunk in section.chunks),
"sections": len(doc.sections)
})
except Exception as e:
print(f"Error processing {file_path}: {e}")
return results
# Process a directory
results = process_directory("./documents")
for result in results:
print(f"{result['file']}: {result['sections']} sections, {result['content_length']} chars")
Configuration
Environment Variables
PANPARSEX_USER_AGENT: Custom user agent for web scrapingPANPARSEX_TIMEOUT: Request timeout in seconds (default: 15)PANPARSEX_DELAY: Delay between requests in seconds (default: 0)
CLI Options
panparsex parse [OPTIONS] TARGET
Options:
--recursive Enable recursive processing
--glob TEXT Glob pattern for directory processing
--max-links INTEGER Maximum links to follow (web scraping)
--max-depth INTEGER Maximum crawl depth (web scraping)
--same-origin Restrict crawling to same origin
--pretty Pretty-print JSON output
--help Show help message
Examples
Extract Text from PDF
from panparsex import parse
doc = parse("report.pdf")
for section in doc.sections:
print(f"Page {section.meta.get('page_number', 'Unknown')}:")
print(section.chunks[0].text[:200] + "...")
Parse Excel Spreadsheet
from panparsex import parse
doc = parse("data.xlsx")
for section in doc.sections:
if section.meta.get("type") == "sheet":
print(f"Sheet: {section.meta['sheet_name']}")
print(f"Rows: {section.meta['rows']}, Cols: {section.meta['cols']}")
print(section.chunks[0].text[:300] + "...")
Scrape Website Content
from panparsex import parse
doc = parse("https://news-website.com", recursive=True, max_links=20, max_depth=2)
print(f"Crawled {doc.meta.extra['pages_parsed']} pages")
print(f"Unique domains: {doc.meta.extra['crawl_stats']['unique_domains']}")
for section in doc.sections:
if section.meta.get("url"):
print(f"\nFrom {section.meta['url']}:")
print(f"Title: {section.heading}")
print(f"Content: {section.chunks[0].text[:200]}...")
Contributing
We welcome contributions! Please see our Contributing Guide for details.
Adding New Parsers
- Create a new parser class implementing
ParserProtocol - Add it to the
parsers/directory - Register it in the core module
- Add tests and documentation
Development Setup
git clone https://github.com/dhruvildarji/panparsex.git
cd panparsex
pip install -e ".[dev]"
pytest
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Changelog
v0.1.0 (2024-01-XX)
- Initial release
- Support for 13+ file types
- Web scraping capabilities
- Plugin architecture
- Comprehensive test suite
Support
- 📧 Email: dhruvil.darji@gmail.com
- 🐛 Issues: GitHub Issues
- 📖 Documentation: GitHub Wiki
Roadmap
- OCR support for scanned documents
- Audio/video transcription
- Database connection parsing
- Cloud storage integration
- Advanced web scraping (Selenium support)
- Content deduplication
- Language detection
- Sentiment analysis integration
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file panparsex-0.1.0.tar.gz.
File metadata
- Download URL: panparsex-0.1.0.tar.gz
- Upload date:
- Size: 42.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c21e1ae660901ca4a382eed6660cb52ee591bcadcc82f31eded5c9795ec5156f
|
|
| MD5 |
dfb1c861d081761ecc0112f949345586
|
|
| BLAKE2b-256 |
e541e5cc8b905835bdcdc484b03b3b38b5f43b30b4e6703e0836ca36382545c6
|
File details
Details for the file panparsex-0.1.0-py3-none-any.whl.
File metadata
- Download URL: panparsex-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
972d7f85079140b8830be47b67e64932d825091243ed7640c254cb27d042b785
|
|
| MD5 |
121f6e827ee9abd3027f3272a605d174
|
|
| BLAKE2b-256 |
bc1c1ef9f3be2aba54f74fb59300fe766fd64cc2b9badbedc826cac62dc494e3
|