This is a scraper for LLM.

Project description

scraper4ai

scraper4ai is a powerful and easy-to-use Python library for web scraping, specifically designed to prepare web content for AI and Large Language Model (LLM) applications. It fetches web pages, cleans the HTML, and converts the main content into clean, structured Markdown. It also extracts valuable data like links, images, and videos. The library is built with asynchronous support from the ground up, allowing for efficient scraping of multiple URLs concurrently.

Features

AI-Ready Content: Converts messy HTML into clean Markdown, perfect for LLM processing.
Asynchronous Support: Scrape multiple URLs concurrently with invoke_all for high performance.
Rich Data Extraction: Extracts not just the main content, but also hyperlinks, images, and video sources.
JA3/TLS Fingerprint Spoofing: Uses curl_cffi to impersonate real browser profiles (like Chrome 136), helping to bypass many anti-bot measures.
Optimized Performance: Session reuse and connection pooling for improved efficiency and reduced overhead.
Customizable Cleaning: Easily specify which HTML tags or CSS selectors to remove before Markdown conversion.
Resource Management: Automatic session handling with proper cleanup methods.
Simple API: Get started in just a few lines of code with an intuitive API.

Installation

pip install scraper4ai

Usage

Basic Usage

Here's a simple example of how to scrape a single URL and get the clean Markdown content.

from scraper4ai import WebScraper

# Initialize the scraper
scraper = WebScraper()

# Scrape a single URL
url = "https://example.com"
result = scraper.invoke(url)

if result.status_code == 200:
    print(result.markdown)
else:
    print(f"Failed to scrape {url}. Status code: {result.status_code}")

Batch Scraping

Use invoke_all to efficiently process a list of URLs concurrently.

from scraper4ai import WebScraper

# Initialize the scraper
scraper = WebScraper()

urls = ["https://www.python.org/", "https://github.com/"]

# Scrape all URLs concurrently
results = scraper.invoke_all(urls)

for result in results:
    if result.status_code == 200:
        print(f"--- Content from {result.url} ---")
        print(result.markdown)
        print("-" * 20)
    else:
        print(f"Failed to scrape {result.url}. Status code: {result.status_code}")

Customizing HTML Cleaning

You can easily remove unwanted HTML tags or elements matching CSS selectors before the content is converted to Markdown.

from scraper4ai import WebScraper

scraper = WebScraper()

# Add custom rules to remove navigation and footer elements
scraper.ignore_these_tags_in_markdown(["nav", "footer"])
# Add custom rule to remove any element with class="cookie-banner"
scraper.ignore_these_css_in_markdown([".cookie-banner"])

# These rules will be applied to all subsequent .invoke() or .invoke_all() calls
result = scraper.invoke("https://example.com")
print(result.markdown)

# Don't forget to close the session when done to free resources
scraper.close()

The `ScrapedResult` Object

The invoke() and invoke_all() methods return ScrapedResult objects (or a list of them). This object contains all the data you've scraped.

from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class LinkData:
    url: str
    text: Optional[str] = None

@dataclass
class ImageData:
    url: str
    alt_text: Optional[str] = None

@dataclass
class VideoData:
    url: str
    title: Optional[str] = None

@dataclass
class ScrapedResult:
    url: str
    status_code: int
    raw_html: Optional[str]
    markdown: Optional[str]
    links: Optional[List[LinkData]] = field(default_factory=list)
    image_links: Optional[List[ImageData]] = field(default_factory=list)
    video_links: Optional[List[VideoData]] = field(default_factory=list)

url (str): The original URL that was scraped.
status_code (int): The HTTP status code of the response. On failure, this will be -1 or the actual error code.
raw_html (Optional[str]): The original, unmodified HTML content of the page. None on failure.
markdown (Optional[str]): The cleaned, converted Markdown content. None on failure.
links (Optional[List[LinkData]]): A list of all hyperlinks found on the page. None on failure.
image_links (Optional[List[ImageData]]): A list of all images found on the page. None on failure.
video_links (Optional[List[VideoData]]): A list of all videos found on the page. None on failure.

Advanced Features

Browser Impersonation

The library uses the latest Chrome 136 browser fingerprints for maximum compatibility and anti-bot detection avoidance. The impersonation automatically adapts for mobile devices when needed.

Retry Logic

Intelligent retry mechanism with exponential backoff to handle temporary network issues gracefully without overwhelming servers.

Error Handling

If the scraper fails to fetch a URL after several retries, it will not raise an exception. Instead, it returns a ScrapedResult object where:

status_code is set to -1 (or the actual HTTP error status code if one was received).
raw_html, markdown, and the link lists are set to None.

This design allows you to handle failures gracefully without crashing, especially during batch processing.

Performance Tips

Session Reuse: The WebScraper automatically reuses HTTP sessions for better performance when making multiple sequential requests.
Batch Processing: Use invoke_all() for concurrent processing of multiple URLs with optimized connection pooling.
Resource Cleanup: Call scraper.close() when finished to properly release session resources.
Connection Limits: The async session limits concurrent connections to prevent overwhelming target servers.

from scraper4ai import WebScraper

# Create scraper instance
scraper = WebScraper()

# Process multiple URLs efficiently
results = scraper.invoke_all([
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
])

# Clean up resources
scraper.close()

Project details

Release history Release notifications | RSS feed

This version

1.0.1

Nov 2, 2025

1.0.0

Aug 31, 2025

0.2.8

Jul 15, 2025

0.2.7

Jul 15, 2025

0.2.6

Jul 15, 2025

0.2.5

Jul 15, 2025

0.2.4

Jul 15, 2025

0.2.3

Jul 15, 2025

0.2.2

Jul 15, 2025

0.2.1

Jul 15, 2025

0.2.0

Jul 15, 2025

0.1.4

Jul 7, 2025

0.1.3

May 24, 2025

0.1.2

May 7, 2025

0.1.1

May 7, 2025

0.1.0

May 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper4ai-1.0.1.tar.gz (6.5 kB view details)

Uploaded Nov 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scraper4ai-1.0.1-py3-none-any.whl (7.0 kB view details)

Uploaded Nov 2, 2025 Python 3

File details

Details for the file scraper4ai-1.0.1.tar.gz.

File metadata

Download URL: scraper4ai-1.0.1.tar.gz
Upload date: Nov 2, 2025
Size: 6.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for scraper4ai-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c18d5daf446b485150223496e9c7d9d7c6280d9db9de2f3ac0cc8e2e1133e777`
MD5	`b3e92d59cab8f27e02a2425275231026`
BLAKE2b-256	`6a958ddabcdf7d428389fbbe84d145bb8a20299389e2418fe13d737562351c76`

See more details on using hashes here.

File details

Details for the file scraper4ai-1.0.1-py3-none-any.whl.

File metadata

Download URL: scraper4ai-1.0.1-py3-none-any.whl
Upload date: Nov 2, 2025
Size: 7.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for scraper4ai-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d99ff621ed744a57c939637d44de0f06115957ae4da13678ccb93608deb7959f`
MD5	`4aab46919d30f2325393c10ea3c741b6`
BLAKE2b-256	`a26e73398ac640b19f5fd5fd24bf7b6bc5cbd44638f8f03b88ca3cb2fd12386a`

See more details on using hashes here.

scraper4ai 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

scraper4ai

Features

Installation

Usage

Basic Usage

Batch Scraping

Customizing HTML Cleaning

The `ScrapedResult` Object

Advanced Features

Browser Impersonation

Retry Logic

Error Handling

Performance Tips

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

scraper4ai 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

scraper4ai

Features

Installation

Usage

Basic Usage

Batch Scraping

Customizing HTML Cleaning

The ScrapedResult Object

Advanced Features

Browser Impersonation

Retry Logic

Error Handling

Performance Tips

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

The `ScrapedResult` Object