This is a scraper for LLM.
Project description
scraper4ai
scraper4ai is a powerful and easy-to-use Python library for web scraping, specifically designed to prepare web content for AI and Large Language Model (LLM) applications. It fetches web pages, cleans the HTML, and converts the main content into clean, structured Markdown. It also extracts valuable data like links, images, and videos. The library is built with asynchronous support from the ground up, allowing for efficient scraping of multiple URLs concurrently.
Features
- AI-Ready Content: Converts messy HTML into clean Markdown, perfect for LLM processing.
- Asynchronous Support: Scrape multiple URLs concurrently with
invoke_allfor high performance. - Rich Data Extraction: Extracts not just the main content, but also hyperlinks, images, and video sources.
- JA3/TLS Fingerprint Spoofing: Uses
curl_cffito impersonate real browser profiles (like Chrome 136), helping to bypass many anti-bot measures. - Optimized Performance: Session reuse and connection pooling for improved efficiency and reduced overhead.
- Customizable Cleaning: Easily specify which HTML tags or CSS selectors to remove before Markdown conversion.
- Resource Management: Automatic session handling with proper cleanup methods.
- Simple API: Get started in just a few lines of code with an intuitive API.
Installation
pip install scraper4ai
Usage
Basic Usage
Here's a simple example of how to scrape a single URL and get the clean Markdown content.
from scraper4ai import WebScraper
# Initialize the scraper
scraper = WebScraper()
# Scrape a single URL
url = "https://example.com"
result = scraper.invoke(url)
if result.status_code == 200:
print(result.markdown)
else:
print(f"Failed to scrape {url}. Status code: {result.status_code}")
Batch Scraping
Use invoke_all to efficiently process a list of URLs concurrently.
from scraper4ai import WebScraper
# Initialize the scraper
scraper = WebScraper()
urls = ["https://www.python.org/", "https://github.com/"]
# Scrape all URLs concurrently
results = scraper.invoke_all(urls)
for result in results:
if result.status_code == 200:
print(f"--- Content from {result.url} ---")
print(result.markdown)
print("-" * 20)
else:
print(f"Failed to scrape {result.url}. Status code: {result.status_code}")
Customizing HTML Cleaning
You can easily remove unwanted HTML tags or elements matching CSS selectors before the content is converted to Markdown.
from scraper4ai import WebScraper
scraper = WebScraper()
# Add custom rules to remove navigation and footer elements
scraper.ignore_these_tags_in_markdown(["nav", "footer"])
# Add custom rule to remove any element with class="cookie-banner"
scraper.ignore_these_css_in_markdown([".cookie-banner"])
# These rules will be applied to all subsequent .invoke() or .invoke_all() calls
result = scraper.invoke("https://example.com")
print(result.markdown)
# Don't forget to close the session when done to free resources
scraper.close()
The ScrapedResult Object
The invoke() and invoke_all() methods return ScrapedResult objects (or a list of them). This object contains all the data you've scraped.
from dataclasses import dataclass, field
from typing import List, Optional
@dataclass
class LinkData:
url: str
text: Optional[str] = None
@dataclass
class ImageData:
url: str
alt_text: Optional[str] = None
@dataclass
class VideoData:
url: str
title: Optional[str] = None
@dataclass
class ScrapedResult:
url: str
status_code: int
raw_html: Optional[str]
markdown: Optional[str]
links: Optional[List[LinkData]] = field(default_factory=list)
image_links: Optional[List[ImageData]] = field(default_factory=list)
video_links: Optional[List[VideoData]] = field(default_factory=list)
url(str): The original URL that was scraped.status_code(int): The HTTP status code of the response. On failure, this will be-1or the actual error code.raw_html(Optional[str]): The original, unmodified HTML content of the page.Noneon failure.markdown(Optional[str]): The cleaned, converted Markdown content.Noneon failure.links(Optional[List[LinkData]]): A list of all hyperlinks found on the page.Noneon failure.image_links(Optional[List[ImageData]]): A list of all images found on the page.Noneon failure.video_links(Optional[List[VideoData]]): A list of all videos found on the page.Noneon failure.
Advanced Features
Browser Impersonation
The library uses the latest Chrome 136 browser fingerprints for maximum compatibility and anti-bot detection avoidance. The impersonation automatically adapts for mobile devices when needed.
Retry Logic
Intelligent retry mechanism with exponential backoff to handle temporary network issues gracefully without overwhelming servers.
Error Handling
If the scraper fails to fetch a URL after several retries, it will not raise an exception. Instead, it returns a ScrapedResult object where:
status_codeis set to-1(or the actual HTTP error status code if one was received).raw_html,markdown, and the link lists are set toNone.
This design allows you to handle failures gracefully without crashing, especially during batch processing.
Performance Tips
- Session Reuse: The WebScraper automatically reuses HTTP sessions for better performance when making multiple sequential requests.
- Batch Processing: Use
invoke_all()for concurrent processing of multiple URLs with optimized connection pooling. - Resource Cleanup: Call
scraper.close()when finished to properly release session resources. - Connection Limits: The async session limits concurrent connections to prevent overwhelming target servers.
from scraper4ai import WebScraper
# Create scraper instance
scraper = WebScraper()
# Process multiple URLs efficiently
results = scraper.invoke_all([
"https://example1.com",
"https://example2.com",
"https://example3.com"
])
# Clean up resources
scraper.close()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraper4ai-1.0.1.tar.gz.
File metadata
- Download URL: scraper4ai-1.0.1.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c18d5daf446b485150223496e9c7d9d7c6280d9db9de2f3ac0cc8e2e1133e777
|
|
| MD5 |
b3e92d59cab8f27e02a2425275231026
|
|
| BLAKE2b-256 |
6a958ddabcdf7d428389fbbe84d145bb8a20299389e2418fe13d737562351c76
|
File details
Details for the file scraper4ai-1.0.1-py3-none-any.whl.
File metadata
- Download URL: scraper4ai-1.0.1-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d99ff621ed744a57c939637d44de0f06115957ae4da13678ccb93608deb7959f
|
|
| MD5 |
4aab46919d30f2325393c10ea3c741b6
|
|
| BLAKE2b-256 |
a26e73398ac640b19f5fd5fd24bf7b6bc5cbd44638f8f03b88ca3cb2fd12386a
|