A Smart, Automatic, Fast and Lightweight Web Scraper for Python with adaptive extraction and cross-site flexible matching.

These details have not been verified by PyPI

Project links

Project description

🕷️ DejavuScraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

DejavuScraper is an intelligent web scraping library that automatically learns scraping rules from examples. Provide a sample of what you want to extract, and it figures out how to get similar data from any page — even pages with completely different HTML structures.

No LLM needed. No per-page training. Learn once, extract everywhere.

✨ Features

Feature	Description
🎯 Smart Learning	Automatically learns scraping rules from examples
🌐 Cross-Site Extraction	`get_result_flexible()` — extract from pages with completely different HTML using content fingerprinting
🔄 Adaptive Extraction	Relocates elements even after website structure changes using 8-dimension similarity matching
📦 Grouped Extraction	Extract multiple related fields per item (name + price + rating)
🔍 Smart Extractors	Built-in extraction for emails, phones, prices, dates, tables, JSON-LD, pagination, and more
💾 Save/Load Models	JSON and SQLite with atomic writes — learn once, deploy anywhere
🎭 Fuzzy Matching	Approximate text matching with configurable ratio
🔗 URL Extraction	Automatically extracts href/src attributes
⚡ Production Ready	Rate limiting, retry with backoff, connection pooling, streaming responses, 10MB size limit
🛡️ Robust	Context manager support, graceful error handling, SHA256 fingerprints

📦 Installation

# Clone the repository
git clone https://github.com/Yukendiran2002/dejavu_scraper.git
cd dejavu_scraper

# Install dependencies
pip install requests

BeautifulSoup4 is bundled — no additional installation needed.

🚀 Quick Start

Basic Example

from dejavu_scraper import DejavuScraper

html = """
<div class="products">
  <div class="product"><h2>iPhone 15</h2><span>$999</span></div>
  <div class="product"><h2>Samsung Galaxy</h2><span>$899</span></div>
  <div class="product"><h2>Google Pixel</h2><span>$699</span></div>
</div>
"""

scraper = DejavuScraper()
result = scraper.build(html=html, wanted_list=['iPhone 15'])
print(result)
# ['iPhone 15', 'Samsung Galaxy', 'Google Pixel']

From URL

scraper = DejavuScraper()
result = scraper.build(
    url='https://example.com/products',
    wanted_list=['Product Name Example']
)

Cross-Site Extraction (No LLM Needed)

# Learn product names from Site A
scraper = DejavuScraper()
scraper.build(html=site_a_html, wanted_list=['iPhone 15'])

# Extract from Site B with COMPLETELY different HTML tags
results = scraper.get_result_flexible(html=site_b_html)
# Works! Finds product names even though tags are different

📚 Core API

Constructor

DejavuScraper(
    stack_list=None,      # Pre-existing rules to use
    adaptive=False,       # Enable adaptive extraction
    min_similarity=0.5,   # Minimum similarity for adaptive matching (0-1)
    rate_limit=0,         # Minimum seconds between requests per domain (0=no limit)
    max_retries=3,        # Retries on transient HTTP errors (5xx, timeouts)
    retry_backoff=1.0     # Backoff factor for retries (seconds)
)

`build()`

Learn extraction rules from examples.

scraper.build(
    url=None,              # URL to scrape
    wanted_list=None,      # List of example strings to find
    wanted_dict=None,      # Dict with aliases: {'title': 'iPhone 15'}
    html=None,             # HTML string (alternative to URL)
    request_args=None,     # Additional request parameters
    update=False,          # True = add to existing rules, False = replace
    text_fuzz_ratio=1.0    # Fuzzy matching ratio (0-1, 1=exact)
)
# Returns: list of all matching results

With aliases:

scraper.build(
    html=html,
    wanted_dict={'product_name': 'iPhone 15', 'price': '$999'}
)

`get_result_similar()`

Extract data from new pages using learned tag-based rules. Works on pages with the same HTML structure.

scraper.get_result_similar(
    url=None,
    html=None,
    request_args=None,
    grouped=False,          # Group results by rule
    group_by_alias=False,   # Group by alias name
    unique=True             # Remove duplicates
)
# Returns: list of extracted values

`get_result_flexible()` — Cross-Site Extraction

Extract data from pages with completely different HTML structures. Uses content fingerprinting instead of tag-based rules.

scraper.get_result_flexible(
    url=None,
    html=None,
    request_args=None,
    min_score=0.4,          # Minimum fingerprint match score (0-1)
    unique=True
)
# Returns: list of extracted values

How it works:

During build(), alongside tag-based rules, a content fingerprint is created capturing the data shape — text length, word count, digit ratio, alpha ratio, currency presence, content type (text/price/date/url/code)
get_result_flexible() scans ALL text nodes on the new page and scores them against the fingerprint
Matches by what the data looks like, not what HTML tag wraps it

# Site A: <h2 class="title">Sony Headphones</h2>
# Site B: <span data-name>Sony Headphones</span>
# Site C: <p id="prod">Sony Headphones</p>

# Learn from Site A, extract from B and C — all work!
scraper = DejavuScraper()
scraper.build(html=site_a, wanted_list=['Sony Headphones'])
scraper.get_result_flexible(html=site_b)  # ✅ Found
scraper.get_result_flexible(html=site_c)  # ✅ Found

`get_result_exact()`

Get results grouped by rule, with optional alias grouping.

scraper.get_result_exact(
    html=html,
    group_by_alias=True
)
# Returns: {'title': [...], 'price': [...]}

`get_result()`

Generic extraction method that combines all result types.

scraper.get_result(
    url=None, html=None,
    grouped=False,
    group_by_alias=False,
    unique=True
)

Grouped Extraction

Extract multiple related fields per item (e.g., name + price + rating):

`build_grouped()`

scraper.build_grouped(
    html=html,
    wanted_list=[
        ['iPhone 15', '$999'],      # Group 1: [field1, field2]
        ['Samsung Galaxy', '$899']   # Group 2
    ]
)

`get_result_grouped()`

results = scraper.get_result_grouped(html=new_html)
# [['iPhone 15', '$999'], ['Samsung Galaxy', '$899'], ['Google Pixel', '$699']]

Handles missing fields gracefully — returns None for fields not found.

Adaptive Extraction

Survives website structure changes using 8-dimension weighted similarity matching.

scraper = DejavuScraper(adaptive=True, min_similarity=0.5)
scraper.build(html=original_html, wanted_list=['Breaking News'])

# Website changes its HTML structure...
results = scraper.get_result_similar(html=changed_html)
# Still finds the element!

Similarity dimensions:

Dimension	Weight
Tag name	15%
Attributes	20%
Text content	15%
DOM path	10%
Parent element	15%
Grandparent element	10%
Children structure	10%
Special attributes	5%

Direct Adaptive API

# Save element fingerprint
scraper.adaptive_save(element, identifier='product_title', url='https://...')

# Find element by fingerprint on changed page
element = scraper.adaptive_find(soup, identifier='product_title')

# Build with adaptive tracking
scraper.adaptive_build(html=html, wanted_list=['...'])

# Get results with adaptive fallback
results = scraper.get_result_adaptive(html=html)

# Find similar elements
similar = scraper.find_similar_elements(html=html, element=target)

🔍 Smart Extractors

Built-in extraction for common data types — no rules needed.

Data Type Extraction

scraper = DejavuScraper()

# Extract all data types at once
data = scraper.extract_data_types(html=html)
# {'email': [...], 'phone': [...], 'price': [...], 'date': [...], ...}

# Or extract specific types
emails = scraper.extract_emails(html=html)
phones = scraper.extract_phones(html=html)
prices = scraper.extract_prices(html=html, parse=True)  # parse=True returns floats

Supported data types:

Type	Example
`email`	`user@example.com`
`phone`	`+1-555-123-4567`, `(800) 555-9999`
`price`	`$999.99`, `€49.00`, `£29.99`
`date`	`2025-01-15`, `January 15, 2025`, `03/15/2025`
`url`	`https://www.example.com`
`number`	`42`, `3.14`
`rating`	`4.5 stars`
`percentage`	`25%`, `99.9%`
`mention`	`@username`
`hashtag`	`#trending`
`ip_address`	`192.168.1.1`
`time`	`14:30`, `2:30 PM`

Table Parsing

# Extract all tables
tables = scraper.extract_tables(html=html, as_dicts=True)
# [{'headers': ['Name', 'Price'], 'rows': [{'Name': 'iPhone', 'Price': '$999'}, ...]}]

# Export table to CSV
csv_string = scraper.extract_table_to_csv(html=html, table_index=0)

Pattern Detection

Automatically detect repeated patterns in HTML:

# Detect lists (ul/ol, repeated siblings)
lists = scraper.detect_lists(html=html, min_items=3)

# Detect card-like layouts
cards = scraper.detect_cards(html=html, min_items=3)

# Auto-detect all patterns
patterns = scraper.auto_detect_patterns(html=html, min_occurrences=3)
# {'lists': [...], 'tables': [...], 'cards': [...]}

Structured Data (JSON-LD, Meta Tags, Microdata)

# Extract all structured data
data = scraper.extract_structured_data(html=html)
# {'json_ld': [...], 'meta_tags': {...}, 'microdata': [...]}

# Or individually
json_ld = scraper.extract_json_ld(html=html)
meta = scraper.extract_meta_tags(html=html)
# {'standard': {...}, 'og': {...}, 'twitter': {...}, 'other': {...}}

Pagination Detection

pagination = scraper.detect_pagination(html=html)
# {
#   'next': '/page/3',
#   'prev': '/page/1',
#   'current': 2,
#   'pages': [{'number': 1, 'url': '/page/1'}, ...]
# }

Regex Extraction

# Simple regex
prices = scraper.extract_with_regex(r'\$[\d,]+\.?\d*', html=html)

# Named groups
matches = scraper.extract_with_named_groups(
    r'(?P<name>\w+)@(?P<domain>[\w.]+)',
    html=html
)
# [{'name': 'john', 'domain': 'example.com'}, ...]

Text Cleaning

# Clean text
clean = scraper.clean_text("  Hello   World  &amp; entities  ", lowercase=True)

# Extract all visible text from HTML, cleaned
text = scraper.extract_clean_text(html=html)

`smart_extract()` — All-in-One

Run all extractors at once:

everything = scraper.smart_extract(html=html)
# {
#   'data_types': {'email': [...], 'price': [...], ...},
#   'patterns': {'lists': [...], 'tables': [...], 'cards': [...]},
#   'structured_data': {'json_ld': [...], 'meta_tags': {...}, 'microdata': [...]},
#   'pagination': {'next': ..., 'prev': ..., 'pages': [...]}
# }

💾 Save & Load

Save learned rules and fingerprints for reuse. Both formats support content fingerprints and adaptive data.

JSON

scraper.save('model.json')

new_scraper = DejavuScraper()
new_scraper.load('model.json')

SQLite

scraper.save('model.db')

new_scraper = DejavuScraper()
new_scraper.load('model.db')

Both formats use atomic writes (temp file + rename) — no data corruption on crash.

Format is auto-detected from file extension, or specify explicitly:

scraper.save('model', format='json')   # or 'db', 'sqlite'

🔧 Rule Management

# View all rules with their IDs
rules = scraper.stack_list
for rule in rules:
    print(rule['stack_id'], rule.get('alias'))

# Keep only specific rules
scraper.keep_rules(['rule_abc123', 'rule_def456'])

# Remove specific rules
scraper.remove_rules(['rule_abc123'])

# Set friendly aliases
scraper.set_rule_aliases({
    'rule_abc123': 'product_title',
    'rule_def456': 'product_price'
})

# Generate reusable Python code from current rules
code = scraper.generate_python_code()
print(code)

🏭 Production Features

Rate Limiting

scraper = DejavuScraper(rate_limit=2)  # 2 seconds between requests per domain

Retry with Exponential Backoff

scraper = DejavuScraper(max_retries=3, retry_backoff=1.0)
# Retries: 1s, 2s, 4s on 5xx errors and timeouts

Connection Pooling

Sessions are reused with requests.Session (pool_maxsize=10) for better performance across multiple requests.

Streaming Responses

HTML is downloaded in chunks (64KB) with a 10MB hard limit — prevents memory bombs from huge pages.

Custom Headers

scraper = DejavuScraper()
scraper.request_headers = {
    'User-Agent': 'Custom Agent',
    'Accept-Language': 'en-US'
}

result = scraper.build(
    url='https://example.com',
    request_args={'timeout': 10, 'verify': False}
)

Context Manager

with DejavuScraper(rate_limit=1) as scraper:
    scraper.build(url='https://example.com', wanted_list=['data'])
    results = scraper.get_result_similar(url='https://example.com/page2')
# Session automatically closed

Or manually:

scraper = DejavuScraper()
# ... use scraper ...
scraper.close()  # Closes the requests session

🧠 How It Works

1. Tag-Based Rules (`build` → `get_result_similar`)

BUILD:  Parse HTML → Find elements matching wanted text → Store tag/attribute rules
EXTRACT: Replay rules on new page → Find elements with same tags/attributes

Each rule stores: tag name, attributes (class, id, etc.), parent chain, and what to extract (text or href/src).

2. Content Fingerprinting (`build` → `get_result_flexible`)

BUILD:  Analyze matched text → Capture data shape (length, word count, digit ratio,
        alpha ratio, currency, content type) → Store fingerprint
EXTRACT: Scan ALL text nodes on new page → Score each against fingerprint → Return matches

This is what enables cross-site extraction — no LLM, no model, just pattern matching on data shape.

3. Adaptive Matching (`build` with `adaptive=True`)

BUILD:  Fingerprint each element across 8 dimensions (tag, attributes, text, DOM path,
        parent, grandparent, children, special attrs) → Store signature
EXTRACT: When tag rules fail → Compare all candidates using weighted similarity
         → Return best match above min_similarity threshold

🧪 Testing

# Run all test suites
python test_all.py              # 16 core tests
python test_comprehensive.py    # 61 tests across 15 categories
python test_extractors.py       # 40 tests for smart extractors
python test_cross_site.py       # 25 cross-site flexible extraction tests

# Total: 142 tests, 100% pass rate

Test coverage includes:

Core build & extraction
Same-site and cross-site extraction
Grouped extraction with missing fields
Save/Load JSON and SQLite (with fingerprints)
Adaptive extraction with structure changes
Fuzzy matching
URL extraction
Rule management (keep, remove, aliases)
All 7 smart extractor classes
Edge cases and large data handling
Content fingerprint quality validation

📁 Project Structure

dejavu_scraper/
├── dejavu_scraper/
│   ├── __init__.py             # Package exports (v2.0.0)
│   ├── dejavu_scraper.py       # Main DejavuScraper class (~3000 lines)
│   ├── extractors.py           # 7 smart extractor classes (~1100 lines)
│   ├── adaptive_storage.py     # Thread-safe element fingerprint storage
│   ├── adaptive_matcher.py     # 8-dimension weighted similarity matching
│   ├── utils.py                # ResultItem, FuzzyText helpers
│   └── beautifulsoup4/         # Bundled BeautifulSoup4
├── test_all.py                 # 16 core tests
├── test_comprehensive.py       # 61 tests (15 categories)
├── test_extractors.py          # 40 extractor tests
├── test_cross_site.py          # 25 cross-site tests
├── DOCUMENTATION.md            # Detailed API documentation
├── ADAPTIVE_GUIDE.md           # Adaptive extraction deep-dive
├── TEST_DOCUMENTATION.md       # Test suite documentation
├── pyproject.toml              # Package configuration
├── requirements.txt            # Dependencies (requests>=2.25.0)
├── LICENSE                     # MIT License
└── README.md                   # This file

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License — see LICENSE for details.

🙏 Acknowledgments

Original AutoScraper by Alireza Mika
Adaptive extraction inspired by Scrapling
BeautifulSoup4 for HTML parsing

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dejavu_scraper-2.0.0.tar.gz (737.8 kB view details)

Uploaded Feb 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dejavu_scraper-2.0.0-py3-none-any.whl (762.6 kB view details)

Uploaded Feb 8, 2026 Python 3

File details

Details for the file dejavu_scraper-2.0.0.tar.gz.

File metadata

Download URL: dejavu_scraper-2.0.0.tar.gz
Upload date: Feb 8, 2026
Size: 737.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dejavu_scraper-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`5b9988251fdeeca515233c683274dad0305cae39da539c8b1610ba6853ecb6db`
MD5	`fe42c36733bb6bc17d1e0939fcc8edc8`
BLAKE2b-256	`a2fe8ab158df3d3013ed4edaf4109188a6b4b644a583c0cff1de10ca1e5ab098`

See more details on using hashes here.

File details

Details for the file dejavu_scraper-2.0.0-py3-none-any.whl.

File metadata

Download URL: dejavu_scraper-2.0.0-py3-none-any.whl
Upload date: Feb 8, 2026
Size: 762.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dejavu_scraper-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6be423b27dab58e31df3223c52a5c7e98de354553ac3c481147e626f5d3c0bd8`
MD5	`8ac85c6ab7d473a4c49a8e2a69da7b8c`
BLAKE2b-256	`f19d939e75ebce3ceb0d4fb79b7aa0ffa05bb228dc4749bc4deb003b45ba8d30`

See more details on using hashes here.

dejavu-scraper 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🕷️ DejavuScraper

📑 Table of Contents

✨ Features

📦 Installation

🚀 Quick Start

Basic Example

From URL

Cross-Site Extraction (No LLM Needed)

📚 Core API

Constructor

build()

get_result_similar()

get_result_flexible() — Cross-Site Extraction

get_result_exact()

get_result()

Grouped Extraction

build_grouped()

get_result_grouped()

Adaptive Extraction

Direct Adaptive API

🔍 Smart Extractors

Data Type Extraction

Table Parsing

Pattern Detection

Structured Data (JSON-LD, Meta Tags, Microdata)

Pagination Detection

Regex Extraction

Text Cleaning

smart_extract() — All-in-One

💾 Save & Load

JSON

SQLite

🔧 Rule Management

🏭 Production Features

Rate Limiting

Retry with Exponential Backoff

Connection Pooling

Streaming Responses

Custom Headers

Context Manager

🧠 How It Works

1. Tag-Based Rules (build → get_result_similar)

2. Content Fingerprinting (build → get_result_flexible)

3. Adaptive Matching (build with adaptive=True)

🧪 Testing

📁 Project Structure

🤝 Contributing

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`build()`

`get_result_similar()`

`get_result_flexible()` — Cross-Site Extraction

`get_result_exact()`

`get_result()`

`build_grouped()`

`get_result_grouped()`

`smart_extract()` — All-in-One

1. Tag-Based Rules (`build` → `get_result_similar`)

2. Content Fingerprinting (`build` → `get_result_flexible`)

3. Adaptive Matching (`build` with `adaptive=True`)