Skip to main content

A powerful web content fetcher and processor

Project description

FastParser 🚀

License: MIT Python 3.7+ Code style: black

A high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.

✨ Features

  • 🚄 Asynchronous content fetching
  • 📄 PDF extraction support
  • 🌐 HTML parsing
  • 📚 Special handling for arXiv URLs
  • 📦 Batch processing capability
  • 🔄 Progress tracking with tqdm

🛠️ Installation

pip install fastparser

# Dependencies
pip install aiohttp PyPDF2 tqdm

🚀 Quick Start

from fastparser import parse

# Single URL parsing
text = parse("https://example.com")

# Batch processing
urls = [
    "https://example.com",
    "https://arxiv.org/abs/2301.01234",
    "https://example.com/document.pdf"
]
texts = parse(urls)

📖 Detailed Usage

Basic Parser Configuration

from fastparser import FastParser

# Initialize with PDF extraction (default: True)
parser = FastParser(extract_pdf=True)

# Single URL
content = parser.fetch("https://example.com")

# Multiple URLs
contents = parser.fetch_batch([
    "https://example.com",
    "https://arxiv.org/abs/2301.01234"
])

Working with arXiv Papers

The parser automatically handles different arXiv URL formats:

parser = FastParser()

# These will be automatically converted to appropriate formats
urls = [
    "https://arxiv.org/abs/2301.01234",  # Will fetch PDF if extract_pdf=True
    "http://arxiv.org/html/2301.01234",  # Will fetch HTML or PDF based on settings
]
contents = parser.fetch_batch(urls)

PDF-Only Processing

parser = FastParser(extract_pdf=True)

pdf_urls = [
    "https://example.com/document.pdf",
    "https://arxiv.org/pdf/2301.01234.pdf"
]
pdf_contents = parser.fetch_batch(pdf_urls)

🔧 API Reference

FastParser Class

class FastParser:
    def __init__(self, extract_pdf: bool = True)
    def fetch(self, url: str) -> str
    def fetch_batch(self, urls: list) -> list
    def __call__(self, urls: str|list) -> str|list

Main Functions

  • parse(urls: str|list) -> str|list: Convenience function for quick parsing
  • _async_html_parser(urls: list): Internal async processing method
  • _fetch_pdf_content(pdf_urls: list): Internal PDF processing method
  • _arxiv_url_fix(url: str): Internal arXiv URL formatting method

⚡ Performance

The parser uses asynchronous operations for optimal performance:

  • Concurrent URL fetching
  • Batch processing capabilities
  • Progress tracking with tqdm
  • Memory-efficient PDF processing

🔍 Example: Advanced Usage

import asyncio
from fastparser import FastParser

async def process_large_dataset():
    parser = FastParser(extract_pdf=True)
    
    # Process URLs in batches
    all_urls = ["url1", "url2", ..., "url1000"]
    batch_size = 50
    
    results = []
    for i in range(0, len(all_urls), batch_size):
        batch = all_urls[i:i + batch_size]
        batch_results = parser.fetch_batch(batch)
        results.extend(batch_results)
        
    return results

# Run with asyncio
results = asyncio.run(process_large_dataset())

⚠️ Error Handling

The parser includes robust error handling:

  • Failed URL fetches return empty strings
  • PDF processing errors are caught gracefully
  • HTTP status checks
  • Invalid URL format handling

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 Dependencies

  • aiohttp: Async HTTP client/server framework
  • PyPDF2: PDF processing library
  • tqdm: Progress bar library
  • Custom FastHTMLParserV3 module

📋 TODO

  • Add support for more document types
  • Implement caching mechanism
  • Add timeout configurations
  • Improve error reporting
  • Add proxy support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❤️ by [Your Name]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parselite-0.3.5.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parselite-0.3.5-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file parselite-0.3.5.tar.gz.

File metadata

  • Download URL: parselite-0.3.5.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.5.tar.gz
Algorithm Hash digest
SHA256 e887069bd4e9879bee74dc7504df96367721fc857288279fb7b76a60426b9606
MD5 ef1b3b4b289694b1cf88c777fb84a5fb
BLAKE2b-256 d9d5c19a1974398492067a846c20b2823f061caf71d5e6fa9f2febf257b30a31

See more details on using hashes here.

File details

Details for the file parselite-0.3.5-py3-none-any.whl.

File metadata

  • Download URL: parselite-0.3.5-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2cc49a3fb888b8537d286e120d23f59e2c12b8979a2a82f3125e7fbd1e14bd62
MD5 220ee0080f923aed95966627b5207d25
BLAKE2b-256 bf331e8454810d69d6b80d2d98061808013e0a702361a0d37e72b84ff5c4dd25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page