Skip to main content

A Smart, Automatic, Fast and Lightweight Web Scraper for Python with adaptive extraction and cross-site flexible matching.

Project description

๐Ÿ•ท๏ธ DejavuScraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Python 3.8+ License: MIT Version

DejavuScraper is an intelligent web scraping library that automatically learns scraping rules from examples. Provide a sample of what you want to extract, and it figures out how to get similar data from any page โ€” even pages with completely different HTML structures.

No LLM needed. No per-page training. Learn once, extract everywhere.


๐Ÿ“‘ Table of Contents


โœจ Features

Feature Description
๐ŸŽฏ Smart Learning Automatically learns scraping rules from examples
๐ŸŒ Cross-Site Extraction get_result_flexible() โ€” extract from pages with completely different HTML using content fingerprinting
๐Ÿ”„ Adaptive Extraction Relocates elements even after website structure changes using 8-dimension similarity matching
๐Ÿ“ฆ Grouped Extraction Extract multiple related fields per item (name + price + rating)
๐Ÿ” Smart Extractors Built-in extraction for emails, phones, prices, dates, tables, JSON-LD, pagination, and more
๐Ÿ’พ Save/Load Models JSON and SQLite with atomic writes โ€” learn once, deploy anywhere
๐ŸŽญ Fuzzy Matching Approximate text matching with configurable ratio
๐Ÿ”— URL Extraction Automatically extracts href/src attributes
โšก Production Ready Rate limiting, retry with backoff, connection pooling, streaming responses, 10MB size limit
๐Ÿ›ก๏ธ Robust Context manager support, graceful error handling, SHA256 fingerprints

๐Ÿ“ฆ Installation

# Clone the repository
git clone https://github.com/Yukendiran2002/dejavu_scraper.git
cd dejavu_scraper

# Install dependencies
pip install requests

BeautifulSoup4 is bundled โ€” no additional installation needed.


๐Ÿš€ Quick Start

Basic Example

from dejavu_scraper import DejavuScraper

html = """
<div class="products">
  <div class="product"><h2>iPhone 15</h2><span>$999</span></div>
  <div class="product"><h2>Samsung Galaxy</h2><span>$899</span></div>
  <div class="product"><h2>Google Pixel</h2><span>$699</span></div>
</div>
"""

scraper = DejavuScraper()
result = scraper.build(html=html, wanted_list=['iPhone 15'])
print(result)
# ['iPhone 15', 'Samsung Galaxy', 'Google Pixel']

From URL

scraper = DejavuScraper()
result = scraper.build(
    url='https://example.com/products',
    wanted_list=['Product Name Example']
)

Cross-Site Extraction (No LLM Needed)

# Learn product names from Site A
scraper = DejavuScraper()
scraper.build(html=site_a_html, wanted_list=['iPhone 15'])

# Extract from Site B with COMPLETELY different HTML tags
results = scraper.get_result_flexible(html=site_b_html)
# Works! Finds product names even though tags are different

๐Ÿ“š Core API

Constructor

DejavuScraper(
    stack_list=None,      # Pre-existing rules to use
    adaptive=False,       # Enable adaptive extraction
    min_similarity=0.5,   # Minimum similarity for adaptive matching (0-1)
    rate_limit=0,         # Minimum seconds between requests per domain (0=no limit)
    max_retries=3,        # Retries on transient HTTP errors (5xx, timeouts)
    retry_backoff=1.0     # Backoff factor for retries (seconds)
)

build()

Learn extraction rules from examples.

scraper.build(
    url=None,              # URL to scrape
    wanted_list=None,      # List of example strings to find
    wanted_dict=None,      # Dict with aliases: {'title': 'iPhone 15'}
    html=None,             # HTML string (alternative to URL)
    request_args=None,     # Additional request parameters
    update=False,          # True = add to existing rules, False = replace
    text_fuzz_ratio=1.0    # Fuzzy matching ratio (0-1, 1=exact)
)
# Returns: list of all matching results

With aliases:

scraper.build(
    html=html,
    wanted_dict={'product_name': 'iPhone 15', 'price': '$999'}
)

get_result_similar()

Extract data from new pages using learned tag-based rules. Works on pages with the same HTML structure.

scraper.get_result_similar(
    url=None,
    html=None,
    request_args=None,
    grouped=False,          # Group results by rule
    group_by_alias=False,   # Group by alias name
    unique=True             # Remove duplicates
)
# Returns: list of extracted values

get_result_flexible() โ€” Cross-Site Extraction

Extract data from pages with completely different HTML structures. Uses content fingerprinting instead of tag-based rules.

scraper.get_result_flexible(
    url=None,
    html=None,
    request_args=None,
    min_score=0.4,          # Minimum fingerprint match score (0-1)
    unique=True
)
# Returns: list of extracted values

How it works:

  1. During build(), alongside tag-based rules, a content fingerprint is created capturing the data shape โ€” text length, word count, digit ratio, alpha ratio, currency presence, content type (text/price/date/url/code)
  2. get_result_flexible() scans ALL text nodes on the new page and scores them against the fingerprint
  3. Matches by what the data looks like, not what HTML tag wraps it
# Site A: <h2 class="title">Sony Headphones</h2>
# Site B: <span data-name>Sony Headphones</span>
# Site C: <p id="prod">Sony Headphones</p>

# Learn from Site A, extract from B and C โ€” all work!
scraper = DejavuScraper()
scraper.build(html=site_a, wanted_list=['Sony Headphones'])
scraper.get_result_flexible(html=site_b)  # โœ… Found
scraper.get_result_flexible(html=site_c)  # โœ… Found

get_result_exact()

Get results grouped by rule, with optional alias grouping.

scraper.get_result_exact(
    html=html,
    group_by_alias=True
)
# Returns: {'title': [...], 'price': [...]}

get_result()

Generic extraction method that combines all result types.

scraper.get_result(
    url=None, html=None,
    grouped=False,
    group_by_alias=False,
    unique=True
)

Grouped Extraction

Extract multiple related fields per item (e.g., name + price + rating):

build_grouped()

scraper.build_grouped(
    html=html,
    wanted_list=[
        ['iPhone 15', '$999'],      # Group 1: [field1, field2]
        ['Samsung Galaxy', '$899']   # Group 2
    ]
)

get_result_grouped()

results = scraper.get_result_grouped(html=new_html)
# [['iPhone 15', '$999'], ['Samsung Galaxy', '$899'], ['Google Pixel', '$699']]

Handles missing fields gracefully โ€” returns None for fields not found.


Adaptive Extraction

Survives website structure changes using 8-dimension weighted similarity matching.

scraper = DejavuScraper(adaptive=True, min_similarity=0.5)
scraper.build(html=original_html, wanted_list=['Breaking News'])

# Website changes its HTML structure...
results = scraper.get_result_similar(html=changed_html)
# Still finds the element!

Similarity dimensions:

Dimension Weight
Tag name 15%
Attributes 20%
Text content 15%
DOM path 10%
Parent element 15%
Grandparent element 10%
Children structure 10%
Special attributes 5%

Direct Adaptive API

# Save element fingerprint
scraper.adaptive_save(element, identifier='product_title', url='https://...')

# Find element by fingerprint on changed page
element = scraper.adaptive_find(soup, identifier='product_title')

# Build with adaptive tracking
scraper.adaptive_build(html=html, wanted_list=['...'])

# Get results with adaptive fallback
results = scraper.get_result_adaptive(html=html)

# Find similar elements
similar = scraper.find_similar_elements(html=html, element=target)

๐Ÿ” Smart Extractors

Built-in extraction for common data types โ€” no rules needed.

Data Type Extraction

scraper = DejavuScraper()

# Extract all data types at once
data = scraper.extract_data_types(html=html)
# {'email': [...], 'phone': [...], 'price': [...], 'date': [...], ...}

# Or extract specific types
emails = scraper.extract_emails(html=html)
phones = scraper.extract_phones(html=html)
prices = scraper.extract_prices(html=html, parse=True)  # parse=True returns floats

Supported data types:

Type Example
email user@example.com
phone +1-555-123-4567, (800) 555-9999
price $999.99, โ‚ฌ49.00, ยฃ29.99
date 2025-01-15, January 15, 2025, 03/15/2025
url https://www.example.com
number 42, 3.14
rating 4.5 stars
percentage 25%, 99.9%
mention @username
hashtag #trending
ip_address 192.168.1.1
time 14:30, 2:30 PM

Table Parsing

# Extract all tables
tables = scraper.extract_tables(html=html, as_dicts=True)
# [{'headers': ['Name', 'Price'], 'rows': [{'Name': 'iPhone', 'Price': '$999'}, ...]}]

# Export table to CSV
csv_string = scraper.extract_table_to_csv(html=html, table_index=0)

Pattern Detection

Automatically detect repeated patterns in HTML:

# Detect lists (ul/ol, repeated siblings)
lists = scraper.detect_lists(html=html, min_items=3)

# Detect card-like layouts
cards = scraper.detect_cards(html=html, min_items=3)

# Auto-detect all patterns
patterns = scraper.auto_detect_patterns(html=html, min_occurrences=3)
# {'lists': [...], 'tables': [...], 'cards': [...]}

Structured Data (JSON-LD, Meta Tags, Microdata)

# Extract all structured data
data = scraper.extract_structured_data(html=html)
# {'json_ld': [...], 'meta_tags': {...}, 'microdata': [...]}

# Or individually
json_ld = scraper.extract_json_ld(html=html)
meta = scraper.extract_meta_tags(html=html)
# {'standard': {...}, 'og': {...}, 'twitter': {...}, 'other': {...}}

Pagination Detection

pagination = scraper.detect_pagination(html=html)
# {
#   'next': '/page/3',
#   'prev': '/page/1',
#   'current': 2,
#   'pages': [{'number': 1, 'url': '/page/1'}, ...]
# }

Regex Extraction

# Simple regex
prices = scraper.extract_with_regex(r'\$[\d,]+\.?\d*', html=html)

# Named groups
matches = scraper.extract_with_named_groups(
    r'(?P<name>\w+)@(?P<domain>[\w.]+)',
    html=html
)
# [{'name': 'john', 'domain': 'example.com'}, ...]

Text Cleaning

# Clean text
clean = scraper.clean_text("  Hello   World  &amp; entities  ", lowercase=True)

# Extract all visible text from HTML, cleaned
text = scraper.extract_clean_text(html=html)

smart_extract() โ€” All-in-One

Run all extractors at once:

everything = scraper.smart_extract(html=html)
# {
#   'data_types': {'email': [...], 'price': [...], ...},
#   'patterns': {'lists': [...], 'tables': [...], 'cards': [...]},
#   'structured_data': {'json_ld': [...], 'meta_tags': {...}, 'microdata': [...]},
#   'pagination': {'next': ..., 'prev': ..., 'pages': [...]}
# }

๐Ÿ’พ Save & Load

Save learned rules and fingerprints for reuse. Both formats support content fingerprints and adaptive data.

JSON

scraper.save('model.json')

new_scraper = DejavuScraper()
new_scraper.load('model.json')

SQLite

scraper.save('model.db')

new_scraper = DejavuScraper()
new_scraper.load('model.db')

Both formats use atomic writes (temp file + rename) โ€” no data corruption on crash.

Format is auto-detected from file extension, or specify explicitly:

scraper.save('model', format='json')   # or 'db', 'sqlite'

๐Ÿ”ง Rule Management

# View all rules with their IDs
rules = scraper.stack_list
for rule in rules:
    print(rule['stack_id'], rule.get('alias'))

# Keep only specific rules
scraper.keep_rules(['rule_abc123', 'rule_def456'])

# Remove specific rules
scraper.remove_rules(['rule_abc123'])

# Set friendly aliases
scraper.set_rule_aliases({
    'rule_abc123': 'product_title',
    'rule_def456': 'product_price'
})

# Generate reusable Python code from current rules
code = scraper.generate_python_code()
print(code)

๐Ÿญ Production Features

Rate Limiting

scraper = DejavuScraper(rate_limit=2)  # 2 seconds between requests per domain

Retry with Exponential Backoff

scraper = DejavuScraper(max_retries=3, retry_backoff=1.0)
# Retries: 1s, 2s, 4s on 5xx errors and timeouts

Connection Pooling

Sessions are reused with requests.Session (pool_maxsize=10) for better performance across multiple requests.

Streaming Responses

HTML is downloaded in chunks (64KB) with a 10MB hard limit โ€” prevents memory bombs from huge pages.

Custom Headers

scraper = DejavuScraper()
scraper.request_headers = {
    'User-Agent': 'Custom Agent',
    'Accept-Language': 'en-US'
}

result = scraper.build(
    url='https://example.com',
    request_args={'timeout': 10, 'verify': False}
)

Context Manager

with DejavuScraper(rate_limit=1) as scraper:
    scraper.build(url='https://example.com', wanted_list=['data'])
    results = scraper.get_result_similar(url='https://example.com/page2')
# Session automatically closed

Or manually:

scraper = DejavuScraper()
# ... use scraper ...
scraper.close()  # Closes the requests session

๐Ÿง  How It Works

1. Tag-Based Rules (build โ†’ get_result_similar)

BUILD:  Parse HTML โ†’ Find elements matching wanted text โ†’ Store tag/attribute rules
EXTRACT: Replay rules on new page โ†’ Find elements with same tags/attributes

Each rule stores: tag name, attributes (class, id, etc.), parent chain, and what to extract (text or href/src).

2. Content Fingerprinting (build โ†’ get_result_flexible)

BUILD:  Analyze matched text โ†’ Capture data shape (length, word count, digit ratio,
        alpha ratio, currency, content type) โ†’ Store fingerprint
EXTRACT: Scan ALL text nodes on new page โ†’ Score each against fingerprint โ†’ Return matches

This is what enables cross-site extraction โ€” no LLM, no model, just pattern matching on data shape.

3. Adaptive Matching (build with adaptive=True)

BUILD:  Fingerprint each element across 8 dimensions (tag, attributes, text, DOM path,
        parent, grandparent, children, special attrs) โ†’ Store signature
EXTRACT: When tag rules fail โ†’ Compare all candidates using weighted similarity
         โ†’ Return best match above min_similarity threshold

๐Ÿงช Testing

# Run all test suites
python test_all.py              # 16 core tests
python test_comprehensive.py    # 61 tests across 15 categories
python test_extractors.py       # 40 tests for smart extractors
python test_cross_site.py       # 25 cross-site flexible extraction tests

# Total: 142 tests, 100% pass rate

Test coverage includes:

  • Core build & extraction
  • Same-site and cross-site extraction
  • Grouped extraction with missing fields
  • Save/Load JSON and SQLite (with fingerprints)
  • Adaptive extraction with structure changes
  • Fuzzy matching
  • URL extraction
  • Rule management (keep, remove, aliases)
  • All 7 smart extractor classes
  • Edge cases and large data handling
  • Content fingerprint quality validation

๐Ÿ“ Project Structure

dejavu_scraper/
โ”œโ”€โ”€ dejavu_scraper/
โ”‚   โ”œโ”€โ”€ __init__.py             # Package exports (v2.0.0)
โ”‚   โ”œโ”€โ”€ dejavu_scraper.py       # Main DejavuScraper class (~3000 lines)
โ”‚   โ”œโ”€โ”€ extractors.py           # 7 smart extractor classes (~1100 lines)
โ”‚   โ”œโ”€โ”€ adaptive_storage.py     # Thread-safe element fingerprint storage
โ”‚   โ”œโ”€โ”€ adaptive_matcher.py     # 8-dimension weighted similarity matching
โ”‚   โ”œโ”€โ”€ utils.py                # ResultItem, FuzzyText helpers
โ”‚   โ””โ”€โ”€ beautifulsoup4/         # Bundled BeautifulSoup4
โ”œโ”€โ”€ test_all.py                 # 16 core tests
โ”œโ”€โ”€ test_comprehensive.py       # 61 tests (15 categories)
โ”œโ”€โ”€ test_extractors.py          # 40 extractor tests
โ”œโ”€โ”€ test_cross_site.py          # 25 cross-site tests
โ”œโ”€โ”€ DOCUMENTATION.md            # Detailed API documentation
โ”œโ”€โ”€ ADAPTIVE_GUIDE.md           # Adaptive extraction deep-dive
โ”œโ”€โ”€ TEST_DOCUMENTATION.md       # Test suite documentation
โ”œโ”€โ”€ pyproject.toml              # Package configuration
โ”œโ”€โ”€ requirements.txt            # Dependencies (requests>=2.25.0)
โ”œโ”€โ”€ LICENSE                     # MIT License
โ””โ”€โ”€ README.md                   # This file

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


๐Ÿ“„ License

This project is licensed under the MIT License โ€” see LICENSE for details.


๐Ÿ™ Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dejavu_scraper-2.0.0.tar.gz (737.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dejavu_scraper-2.0.0-py3-none-any.whl (762.6 kB view details)

Uploaded Python 3

File details

Details for the file dejavu_scraper-2.0.0.tar.gz.

File metadata

  • Download URL: dejavu_scraper-2.0.0.tar.gz
  • Upload date:
  • Size: 737.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dejavu_scraper-2.0.0.tar.gz
Algorithm Hash digest
SHA256 5b9988251fdeeca515233c683274dad0305cae39da539c8b1610ba6853ecb6db
MD5 fe42c36733bb6bc17d1e0939fcc8edc8
BLAKE2b-256 a2fe8ab158df3d3013ed4edaf4109188a6b4b644a583c0cff1de10ca1e5ab098

See more details on using hashes here.

File details

Details for the file dejavu_scraper-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: dejavu_scraper-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 762.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dejavu_scraper-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6be423b27dab58e31df3223c52a5c7e98de354553ac3c481147e626f5d3c0bd8
MD5 8ac85c6ab7d473a4c49a8e2a69da7b8c
BLAKE2b-256 f19d939e75ebce3ceb0d4fb79b7aa0ffa05bb228dc4749bc4deb003b45ba8d30

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page