Skip to main content

A powerful web content fetcher and processor

Project description

FastParser 🚀

License: MIT Python 3.7+ Code style: black

A high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.

✨ Features

  • 🚄 Asynchronous content fetching
  • 📄 PDF extraction support
  • 🌐 HTML parsing
  • 📚 Special handling for arXiv URLs
  • 📦 Batch processing capability
  • 🔄 Progress tracking with tqdm

🛠️ Installation

pip install fastparser

# Dependencies
pip install aiohttp PyPDF2 tqdm

🚀 Quick Start

from fastparser import parse

# Single URL parsing
text = parse("https://example.com")

# Batch processing
urls = [
    "https://example.com",
    "https://arxiv.org/abs/2301.01234",
    "https://example.com/document.pdf"
]
texts = parse(urls)

📖 Detailed Usage

Basic Parser Configuration

from fastparser import FastParser

# Initialize with PDF extraction (default: True)
parser = FastParser(extract_pdf=True)

# Single URL
content = parser.fetch("https://example.com")

# Multiple URLs
contents = parser.fetch_batch([
    "https://example.com",
    "https://arxiv.org/abs/2301.01234"
])

Working with arXiv Papers

The parser automatically handles different arXiv URL formats:

parser = FastParser()

# These will be automatically converted to appropriate formats
urls = [
    "https://arxiv.org/abs/2301.01234",  # Will fetch PDF if extract_pdf=True
    "http://arxiv.org/html/2301.01234",  # Will fetch HTML or PDF based on settings
]
contents = parser.fetch_batch(urls)

PDF-Only Processing

parser = FastParser(extract_pdf=True)

pdf_urls = [
    "https://example.com/document.pdf",
    "https://arxiv.org/pdf/2301.01234.pdf"
]
pdf_contents = parser.fetch_batch(pdf_urls)

🔧 API Reference

FastParser Class

class FastParser:
    def __init__(self, extract_pdf: bool = True)
    def fetch(self, url: str) -> str
    def fetch_batch(self, urls: list) -> list
    def __call__(self, urls: str|list) -> str|list

Main Functions

  • parse(urls: str|list) -> str|list: Convenience function for quick parsing
  • _async_html_parser(urls: list): Internal async processing method
  • _fetch_pdf_content(pdf_urls: list): Internal PDF processing method
  • _arxiv_url_fix(url: str): Internal arXiv URL formatting method

⚡ Performance

The parser uses asynchronous operations for optimal performance:

  • Concurrent URL fetching
  • Batch processing capabilities
  • Progress tracking with tqdm
  • Memory-efficient PDF processing

🔍 Example: Advanced Usage

import asyncio
from fastparser import FastParser

async def process_large_dataset():
    parser = FastParser(extract_pdf=True)
    
    # Process URLs in batches
    all_urls = ["url1", "url2", ..., "url1000"]
    batch_size = 50
    
    results = []
    for i in range(0, len(all_urls), batch_size):
        batch = all_urls[i:i + batch_size]
        batch_results = parser.fetch_batch(batch)
        results.extend(batch_results)
        
    return results

# Run with asyncio
results = asyncio.run(process_large_dataset())

⚠️ Error Handling

The parser includes robust error handling:

  • Failed URL fetches return empty strings
  • PDF processing errors are caught gracefully
  • HTTP status checks
  • Invalid URL format handling

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 Dependencies

  • aiohttp: Async HTTP client/server framework
  • PyPDF2: PDF processing library
  • tqdm: Progress bar library
  • Custom FastHTMLParserV3 module

📋 TODO

  • Add support for more document types
  • Implement caching mechanism
  • Add timeout configurations
  • Improve error reporting
  • Add proxy support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❤️ by [Your Name]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parselite-0.3.12.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parselite-0.3.12-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file parselite-0.3.12.tar.gz.

File metadata

  • Download URL: parselite-0.3.12.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.12.tar.gz
Algorithm Hash digest
SHA256 aa28d77a84ffb24d232c5a3f3cf66204ea55418bbebf14638d3a9cebf48ac995
MD5 1a75b080a010e18368d06ad28f1aaa15
BLAKE2b-256 052698585a500fff0aec4fc671a2692e1540d7ac850603a333717fc1f2b76cab

See more details on using hashes here.

File details

Details for the file parselite-0.3.12-py3-none-any.whl.

File metadata

  • Download URL: parselite-0.3.12-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.12-py3-none-any.whl
Algorithm Hash digest
SHA256 c4549c85742177489587247bf462e1f4a374c4621bcce43cf0cdd6af390839fd
MD5 4c5b27fbd21009894f36723661a23baa
BLAKE2b-256 38b9f21705c0ab8c6f3d9199240da7a681a4a17178be2988a1ad7ac877910d3c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page