Skip to main content

A powerful web content fetcher and processor

Project description

FastParser 🚀

License: MIT Python 3.7+ Code style: black

A high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.

✨ Features

  • 🚄 Asynchronous content fetching
  • 📄 PDF extraction support
  • 🌐 HTML parsing
  • 📚 Special handling for arXiv URLs
  • 📦 Batch processing capability
  • 🔄 Progress tracking with tqdm

🛠️ Installation

pip install fastparser

# Dependencies
pip install aiohttp PyPDF2 tqdm

🚀 Quick Start

from fastparser import parse

# Single URL parsing
text = parse("https://example.com")

# Batch processing
urls = [
    "https://example.com",
    "https://arxiv.org/abs/2301.01234",
    "https://example.com/document.pdf"
]
texts = parse(urls)

📖 Detailed Usage

Basic Parser Configuration

from fastparser import FastParser

# Initialize with PDF extraction (default: True)
parser = FastParser(extract_pdf=True)

# Single URL
content = parser.fetch("https://example.com")

# Multiple URLs
contents = parser.fetch_batch([
    "https://example.com",
    "https://arxiv.org/abs/2301.01234"
])

Working with arXiv Papers

The parser automatically handles different arXiv URL formats:

parser = FastParser()

# These will be automatically converted to appropriate formats
urls = [
    "https://arxiv.org/abs/2301.01234",  # Will fetch PDF if extract_pdf=True
    "http://arxiv.org/html/2301.01234",  # Will fetch HTML or PDF based on settings
]
contents = parser.fetch_batch(urls)

PDF-Only Processing

parser = FastParser(extract_pdf=True)

pdf_urls = [
    "https://example.com/document.pdf",
    "https://arxiv.org/pdf/2301.01234.pdf"
]
pdf_contents = parser.fetch_batch(pdf_urls)

🔧 API Reference

FastParser Class

class FastParser:
    def __init__(self, extract_pdf: bool = True)
    def fetch(self, url: str) -> str
    def fetch_batch(self, urls: list) -> list
    def __call__(self, urls: str|list) -> str|list

Main Functions

  • parse(urls: str|list) -> str|list: Convenience function for quick parsing
  • _async_html_parser(urls: list): Internal async processing method
  • _fetch_pdf_content(pdf_urls: list): Internal PDF processing method
  • _arxiv_url_fix(url: str): Internal arXiv URL formatting method

⚡ Performance

The parser uses asynchronous operations for optimal performance:

  • Concurrent URL fetching
  • Batch processing capabilities
  • Progress tracking with tqdm
  • Memory-efficient PDF processing

🔍 Example: Advanced Usage

import asyncio
from fastparser import FastParser

async def process_large_dataset():
    parser = FastParser(extract_pdf=True)
    
    # Process URLs in batches
    all_urls = ["url1", "url2", ..., "url1000"]
    batch_size = 50
    
    results = []
    for i in range(0, len(all_urls), batch_size):
        batch = all_urls[i:i + batch_size]
        batch_results = parser.fetch_batch(batch)
        results.extend(batch_results)
        
    return results

# Run with asyncio
results = asyncio.run(process_large_dataset())

⚠️ Error Handling

The parser includes robust error handling:

  • Failed URL fetches return empty strings
  • PDF processing errors are caught gracefully
  • HTTP status checks
  • Invalid URL format handling

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 Dependencies

  • aiohttp: Async HTTP client/server framework
  • PyPDF2: PDF processing library
  • tqdm: Progress bar library
  • Custom FastHTMLParserV3 module

📋 TODO

  • Add support for more document types
  • Implement caching mechanism
  • Add timeout configurations
  • Improve error reporting
  • Add proxy support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❤️ by [Your Name]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parselite-0.3.10.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parselite-0.3.10-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file parselite-0.3.10.tar.gz.

File metadata

  • Download URL: parselite-0.3.10.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.10.tar.gz
Algorithm Hash digest
SHA256 2a3e7e5e343a69a554c3b21dd2e96561c2aa02131985d038c54ca9945f24284e
MD5 951b3070631b5ef8b29e2393a0e91e49
BLAKE2b-256 8bfdf8e9dd958c4e759922b607631640bde91e2d64bec60c33c1d9671f2debc1

See more details on using hashes here.

File details

Details for the file parselite-0.3.10-py3-none-any.whl.

File metadata

  • Download URL: parselite-0.3.10-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.10-py3-none-any.whl
Algorithm Hash digest
SHA256 22cebe39fa3285e63287ab0698641cb5af5059fcf1f4e0e259c0f5246f7f037c
MD5 99066fcbb05743419bb7fa37b2685e5b
BLAKE2b-256 ef31ccf4c2582c5dac82a58119ba922ff838aa01f4dcd068ea368da2d8c4f752

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page