A powerful web content fetcher and processor

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

FastParser 🚀

A high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.

✨ Features

🚄 Asynchronous content fetching
📄 PDF extraction support
🌐 HTML parsing
📚 Special handling for arXiv URLs
📦 Batch processing capability
🔄 Progress tracking with tqdm

🛠️ Installation

pip install fastparser

# Dependencies
pip install aiohttp PyPDF2 tqdm

🚀 Quick Start

from fastparser import parse

# Single URL parsing
text = parse("https://example.com")

# Batch processing
urls = [
    "https://example.com",
    "https://arxiv.org/abs/2301.01234",
    "https://example.com/document.pdf"
]
texts = parse(urls)

📖 Detailed Usage

Basic Parser Configuration

from fastparser import FastParser

# Initialize with PDF extraction (default: True)
parser = FastParser(extract_pdf=True)

# Single URL
content = parser.fetch("https://example.com")

# Multiple URLs
contents = parser.fetch_batch([
    "https://example.com",
    "https://arxiv.org/abs/2301.01234"
])

Working with arXiv Papers

The parser automatically handles different arXiv URL formats:

parser = FastParser()

# These will be automatically converted to appropriate formats
urls = [
    "https://arxiv.org/abs/2301.01234",  # Will fetch PDF if extract_pdf=True
    "http://arxiv.org/html/2301.01234",  # Will fetch HTML or PDF based on settings
]
contents = parser.fetch_batch(urls)

PDF-Only Processing

parser = FastParser(extract_pdf=True)

pdf_urls = [
    "https://example.com/document.pdf",
    "https://arxiv.org/pdf/2301.01234.pdf"
]
pdf_contents = parser.fetch_batch(pdf_urls)

🔧 API Reference

FastParser Class

class FastParser:
    def __init__(self, extract_pdf: bool = True)
    def fetch(self, url: str) -> str
    def fetch_batch(self, urls: list) -> list
    def __call__(self, urls: str|list) -> str|list

Main Functions

parse(urls: str|list) -> str|list: Convenience function for quick parsing
_async_html_parser(urls: list): Internal async processing method
_fetch_pdf_content(pdf_urls: list): Internal PDF processing method
_arxiv_url_fix(url: str): Internal arXiv URL formatting method

⚡ Performance

The parser uses asynchronous operations for optimal performance:

Concurrent URL fetching
Batch processing capabilities
Progress tracking with tqdm
Memory-efficient PDF processing

🔍 Example: Advanced Usage

import asyncio
from fastparser import FastParser

async def process_large_dataset():
    parser = FastParser(extract_pdf=True)
    
    # Process URLs in batches
    all_urls = ["url1", "url2", ..., "url1000"]
    batch_size = 50
    
    results = []
    for i in range(0, len(all_urls), batch_size):
        batch = all_urls[i:i + batch_size]
        batch_results = parser.fetch_batch(batch)
        results.extend(batch_results)
        
    return results

# Run with asyncio
results = asyncio.run(process_large_dataset())

⚠️ Error Handling

The parser includes robust error handling:

Failed URL fetches return empty strings
PDF processing errors are caught gracefully
HTTP status checks
Invalid URL format handling

🤝 Contributing

Contributions are welcome! Here's how you can help:

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 Dependencies

aiohttp: Async HTTP client/server framework
PyPDF2: PDF processing library
tqdm: Progress bar library
Custom FastHTMLParserV3 module

📋 TODO

Add support for more document types
Implement caching mechanism
Add timeout configurations
Improve error reporting
Add proxy support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Made with ❤️ by [Your Name]

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.13

Dec 29, 2024

0.3.12

Dec 26, 2024

0.3.11

Dec 26, 2024

0.3.10

Dec 26, 2024

0.3.9

Dec 26, 2024

0.3.8

Dec 26, 2024

0.3.7

Dec 20, 2024

0.3.6

Dec 20, 2024

This version

0.3.5

Dec 15, 2024

0.3.4

Dec 14, 2024

0.3.3

Dec 14, 2024

0.3.2

Dec 4, 2024

0.3.1

Dec 4, 2024

0.3.0

Nov 1, 2024

0.2.0

Nov 1, 2024

0.1.0

Nov 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parselite-0.3.5.tar.gz (5.4 kB view details)

Uploaded Dec 15, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parselite-0.3.5-py3-none-any.whl (5.9 kB view details)

Uploaded Dec 15, 2024 Python 3

File details

Details for the file parselite-0.3.5.tar.gz.

File metadata

Download URL: parselite-0.3.5.tar.gz
Upload date: Dec 15, 2024
Size: 5.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`e887069bd4e9879bee74dc7504df96367721fc857288279fb7b76a60426b9606`
MD5	`ef1b3b4b289694b1cf88c777fb84a5fb`
BLAKE2b-256	`d9d5c19a1974398492067a846c20b2823f061caf71d5e6fa9f2febf257b30a31`

See more details on using hashes here.

File details

Details for the file parselite-0.3.5-py3-none-any.whl.

File metadata

Download URL: parselite-0.3.5-py3-none-any.whl
Upload date: Dec 15, 2024
Size: 5.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.0rc1

File hashes

Hashes for parselite-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2cc49a3fb888b8537d286e120d23f59e2c12b8979a2a82f3125e7fbd1e14bd62`
MD5	`220ee0080f923aed95966627b5207d25`
BLAKE2b-256	`bf331e8454810d69d6b80d2d98061808013e0a702361a0d37e72b84ff5c4dd25`

See more details on using hashes here.

parselite 0.3.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FastParser 🚀

✨ Features

🛠️ Installation

🚀 Quick Start

📖 Detailed Usage

Basic Parser Configuration

Working with arXiv Papers

PDF-Only Processing

🔧 API Reference

FastParser Class

Main Functions

⚡ Performance

🔍 Example: Advanced Usage

⚠️ Error Handling

🤝 Contributing

📝 Dependencies

📋 TODO

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes