A powerful web content fetcher and processor
Project description
FastParser 🚀
A high-performance, asynchronous content parser that supports both HTML and PDF extraction with special handling for arXiv papers.
✨ Features
- 🚄 Asynchronous content fetching
- 📄 PDF extraction support
- 🌐 HTML parsing
- 📚 Special handling for arXiv URLs
- 📦 Batch processing capability
- 🔄 Progress tracking with tqdm
🛠️ Installation
pip install fastparser
# Dependencies
pip install aiohttp PyPDF2 tqdm
🚀 Quick Start
from fastparser import parse
# Single URL parsing
text = parse("https://example.com")
# Batch processing
urls = [
"https://example.com",
"https://arxiv.org/abs/2301.01234",
"https://example.com/document.pdf"
]
texts = parse(urls)
📖 Detailed Usage
Basic Parser Configuration
from fastparser import FastParser
# Initialize with PDF extraction (default: True)
parser = FastParser(extract_pdf=True)
# Single URL
content = parser.fetch("https://example.com")
# Multiple URLs
contents = parser.fetch_batch([
"https://example.com",
"https://arxiv.org/abs/2301.01234"
])
Working with arXiv Papers
The parser automatically handles different arXiv URL formats:
parser = FastParser()
# These will be automatically converted to appropriate formats
urls = [
"https://arxiv.org/abs/2301.01234", # Will fetch PDF if extract_pdf=True
"http://arxiv.org/html/2301.01234", # Will fetch HTML or PDF based on settings
]
contents = parser.fetch_batch(urls)
PDF-Only Processing
parser = FastParser(extract_pdf=True)
pdf_urls = [
"https://example.com/document.pdf",
"https://arxiv.org/pdf/2301.01234.pdf"
]
pdf_contents = parser.fetch_batch(pdf_urls)
🔧 API Reference
FastParser Class
class FastParser:
def __init__(self, extract_pdf: bool = True)
def fetch(self, url: str) -> str
def fetch_batch(self, urls: list) -> list
def __call__(self, urls: str|list) -> str|list
Main Functions
parse(urls: str|list) -> str|list: Convenience function for quick parsing_async_html_parser(urls: list): Internal async processing method_fetch_pdf_content(pdf_urls: list): Internal PDF processing method_arxiv_url_fix(url: str): Internal arXiv URL formatting method
⚡ Performance
The parser uses asynchronous operations for optimal performance:
- Concurrent URL fetching
- Batch processing capabilities
- Progress tracking with tqdm
- Memory-efficient PDF processing
🔍 Example: Advanced Usage
import asyncio
from fastparser import FastParser
async def process_large_dataset():
parser = FastParser(extract_pdf=True)
# Process URLs in batches
all_urls = ["url1", "url2", ..., "url1000"]
batch_size = 50
results = []
for i in range(0, len(all_urls), batch_size):
batch = all_urls[i:i + batch_size]
batch_results = parser.fetch_batch(batch)
results.extend(batch_results)
return results
# Run with asyncio
results = asyncio.run(process_large_dataset())
⚠️ Error Handling
The parser includes robust error handling:
- Failed URL fetches return empty strings
- PDF processing errors are caught gracefully
- HTTP status checks
- Invalid URL format handling
🤝 Contributing
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
📝 Dependencies
aiohttp: Async HTTP client/server frameworkPyPDF2: PDF processing librarytqdm: Progress bar library- Custom
FastHTMLParserV3module
📋 TODO
- Add support for more document types
- Implement caching mechanism
- Add timeout configurations
- Improve error reporting
- Add proxy support
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
Made with ❤️ by [Your Name]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parselite-0.3.3.tar.gz.
File metadata
- Download URL: parselite-0.3.3.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.0rc1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
189984a2ff097f5441bda474f527723bfe421c5fad870f9d27b4c8d3aeed6262
|
|
| MD5 |
fb30d937e55ec53f0cd9cd93febc3737
|
|
| BLAKE2b-256 |
9f4e735b1758acd4c36e6e1eb4ae6ccffadb1dd5f065434d92ae878b2efe988d
|
File details
Details for the file parselite-0.3.3-py3-none-any.whl.
File metadata
- Download URL: parselite-0.3.3-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.0rc1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3bad75961ed201984230783e765a06db49e47a06d003432cbf798156aa91890
|
|
| MD5 |
1498a8ba4a87c60dea4e5de67fabf8f7
|
|
| BLAKE2b-256 |
d783184d4542740126639d8e3bad3bee7278d83758e83dee5f521cf661406506
|