Skip to main content

A powerful web content fetcher and processor

Project description

FastParser 🚀

License: MIT Python 3.7+ Code style: black

A high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.

✨ Features

  • 🚄 Asynchronous content fetching
  • 📄 PDF extraction support
  • 🌐 HTML parsing
  • 📚 Special handling for arXiv URLs
  • 📦 Batch processing capability
  • 🔄 Progress tracking with tqdm

🛠️ Installation

pip install fastparser

# Dependencies
pip install aiohttp PyPDF2 tqdm

🚀 Quick Start

from fastparser import parse

# Single URL parsing
text = parse("https://example.com")

# Batch processing
urls = [
    "https://example.com",
    "https://arxiv.org/abs/2301.01234",
    "https://example.com/document.pdf"
]
texts = parse(urls)

📖 Detailed Usage

Basic Parser Configuration

from fastparser import FastParser

# Initialize with PDF extraction (default: True)
parser = FastParser(extract_pdf=True)

# Single URL
content = parser.fetch("https://example.com")

# Multiple URLs
contents = parser.fetch_batch([
    "https://example.com",
    "https://arxiv.org/abs/2301.01234"
])

Working with arXiv Papers

The parser automatically handles different arXiv URL formats:

parser = FastParser()

# These will be automatically converted to appropriate formats
urls = [
    "https://arxiv.org/abs/2301.01234",  # Will fetch PDF if extract_pdf=True
    "http://arxiv.org/html/2301.01234",  # Will fetch HTML or PDF based on settings
]
contents = parser.fetch_batch(urls)

PDF-Only Processing

parser = FastParser(extract_pdf=True)

pdf_urls = [
    "https://example.com/document.pdf",
    "https://arxiv.org/pdf/2301.01234.pdf"
]
pdf_contents = parser.fetch_batch(pdf_urls)

🔧 API Reference

FastParser Class

class FastParser:
    def __init__(self, extract_pdf: bool = True)
    def fetch(self, url: str) -> str
    def fetch_batch(self, urls: list) -> list
    def __call__(self, urls: str|list) -> str|list

Main Functions

  • parse(urls: str|list) -> str|list: Convenience function for quick parsing
  • _async_html_parser(urls: list): Internal async processing method
  • _fetch_pdf_content(pdf_urls: list): Internal PDF processing method
  • _arxiv_url_fix(url: str): Internal arXiv URL formatting method

⚡ Performance

The parser uses asynchronous operations for optimal performance:

  • Concurrent URL fetching
  • Batch processing capabilities
  • Progress tracking with tqdm
  • Memory-efficient PDF processing

🔍 Example: Advanced Usage

import asyncio
from fastparser import FastParser

async def process_large_dataset():
    parser = FastParser(extract_pdf=True)
    
    # Process URLs in batches
    all_urls = ["url1", "url2", ..., "url1000"]
    batch_size = 50
    
    results = []
    for i in range(0, len(all_urls), batch_size):
        batch = all_urls[i:i + batch_size]
        batch_results = parser.fetch_batch(batch)
        results.extend(batch_results)
        
    return results

# Run with asyncio
results = asyncio.run(process_large_dataset())

⚠️ Error Handling

The parser includes robust error handling:

  • Failed URL fetches return empty strings
  • PDF processing errors are caught gracefully
  • HTTP status checks
  • Invalid URL format handling

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 Dependencies

  • aiohttp: Async HTTP client/server framework
  • PyPDF2: PDF processing library
  • tqdm: Progress bar library
  • Custom FastHTMLParserV3 module

📋 TODO

  • Add support for more document types
  • Implement caching mechanism
  • Add timeout configurations
  • Improve error reporting
  • Add proxy support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❤️ by [Your Name]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parselite-0.3.4.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parselite-0.3.4-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file parselite-0.3.4.tar.gz.

File metadata

  • Download URL: parselite-0.3.4.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.4.tar.gz
Algorithm Hash digest
SHA256 bec6a81d81d152a38d0609c30a138fae6a244d1f9e97c972f0eaf75fb5489f5f
MD5 6173978f080d98eb871c08e08e8a0854
BLAKE2b-256 f53f83e79bc312da4ccd4e3115a7a39379586cf97adeb41abd7f46c5c4b2972f

See more details on using hashes here.

File details

Details for the file parselite-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: parselite-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b9031fddd9cc85bf53eb025d7e8cd0c3d6aad8344db7c5b254749cc87ce99053
MD5 453aac47ab637c4390192af0a7c8a0f4
BLAKE2b-256 7070c3c333f92faa9e9cf9db80836293ae7be35e99d1155e2202c2ced890155b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page