Skip to main content

A powerful web content fetcher and processor

Project description

FastParser 🚀

License: MIT Python 3.7+ Code style: black

A high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.

✨ Features

  • 🚄 Asynchronous content fetching
  • 📄 PDF extraction support
  • 🌐 HTML parsing
  • 📚 Special handling for arXiv URLs
  • 📦 Batch processing capability
  • 🔄 Progress tracking with tqdm

🛠️ Installation

pip install fastparser

# Dependencies
pip install aiohttp PyPDF2 tqdm

🚀 Quick Start

from fastparser import parse

# Single URL parsing
text = parse("https://example.com")

# Batch processing
urls = [
    "https://example.com",
    "https://arxiv.org/abs/2301.01234",
    "https://example.com/document.pdf"
]
texts = parse(urls)

📖 Detailed Usage

Basic Parser Configuration

from fastparser import FastParser

# Initialize with PDF extraction (default: True)
parser = FastParser(extract_pdf=True)

# Single URL
content = parser.fetch("https://example.com")

# Multiple URLs
contents = parser.fetch_batch([
    "https://example.com",
    "https://arxiv.org/abs/2301.01234"
])

Working with arXiv Papers

The parser automatically handles different arXiv URL formats:

parser = FastParser()

# These will be automatically converted to appropriate formats
urls = [
    "https://arxiv.org/abs/2301.01234",  # Will fetch PDF if extract_pdf=True
    "http://arxiv.org/html/2301.01234",  # Will fetch HTML or PDF based on settings
]
contents = parser.fetch_batch(urls)

PDF-Only Processing

parser = FastParser(extract_pdf=True)

pdf_urls = [
    "https://example.com/document.pdf",
    "https://arxiv.org/pdf/2301.01234.pdf"
]
pdf_contents = parser.fetch_batch(pdf_urls)

🔧 API Reference

FastParser Class

class FastParser:
    def __init__(self, extract_pdf: bool = True)
    def fetch(self, url: str) -> str
    def fetch_batch(self, urls: list) -> list
    def __call__(self, urls: str|list) -> str|list

Main Functions

  • parse(urls: str|list) -> str|list: Convenience function for quick parsing
  • _async_html_parser(urls: list): Internal async processing method
  • _fetch_pdf_content(pdf_urls: list): Internal PDF processing method
  • _arxiv_url_fix(url: str): Internal arXiv URL formatting method

⚡ Performance

The parser uses asynchronous operations for optimal performance:

  • Concurrent URL fetching
  • Batch processing capabilities
  • Progress tracking with tqdm
  • Memory-efficient PDF processing

🔍 Example: Advanced Usage

import asyncio
from fastparser import FastParser

async def process_large_dataset():
    parser = FastParser(extract_pdf=True)
    
    # Process URLs in batches
    all_urls = ["url1", "url2", ..., "url1000"]
    batch_size = 50
    
    results = []
    for i in range(0, len(all_urls), batch_size):
        batch = all_urls[i:i + batch_size]
        batch_results = parser.fetch_batch(batch)
        results.extend(batch_results)
        
    return results

# Run with asyncio
results = asyncio.run(process_large_dataset())

⚠️ Error Handling

The parser includes robust error handling:

  • Failed URL fetches return empty strings
  • PDF processing errors are caught gracefully
  • HTTP status checks
  • Invalid URL format handling

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 Dependencies

  • aiohttp: Async HTTP client/server framework
  • PyPDF2: PDF processing library
  • tqdm: Progress bar library
  • Custom FastHTMLParserV3 module

📋 TODO

  • Add support for more document types
  • Implement caching mechanism
  • Add timeout configurations
  • Improve error reporting
  • Add proxy support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❤️ by [Your Name]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parselite-0.3.1.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parselite-0.3.1-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file parselite-0.3.1.tar.gz.

File metadata

  • Download URL: parselite-0.3.1.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.1.tar.gz
Algorithm Hash digest
SHA256 309eea0740947e8932d049331592648df7208fd2195d8322883fa086e40d6ceb
MD5 7fe80476b07a2ae71922315c2da19604
BLAKE2b-256 008bb2d3af3bb77c197f8925f7c2daa2054db5d2c8d7d1c2e05d29eb00a4bc09

See more details on using hashes here.

File details

Details for the file parselite-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: parselite-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3dd92635eb9b17089e844642f8527b833f47da367fa0fd0573050ef9ea3fa94d
MD5 a51db53961b716785a38143ef41c2534
BLAKE2b-256 6079d4acdfe2b3758ea56cd57d12ac4ab0e509812a27c002e0efceca24973ef8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page