Scrapy middleware to use Ujeebu scrape API

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

Ujeebu Scrapy - Scrapy Middleware for Ujeebu APIs

ujeebu_scrapy is a powerful Scrapy middleware that integrates your crawlers with the Ujeebu APIs for advanced web scraping, content extraction, and search engine results.

Features

🌐 Scrape API: Render JavaScript-heavy pages with a headless browser
📰 Extract API: Automatically extract article content from news and blogs
🔍 SERP API: Get structured Google search results
📸 Screenshots & PDFs: Capture visual snapshots of web pages
🔄 Infinite Scroll: Handle dynamically loaded content
🌍 Geo-Targeting: Use proxies from 100+ countries
🔐 Anti-Bot Bypass: Automatically handles CAPTCHAs and bot detection
📊 Extract Rules: Define CSS-based rules for structured data extraction

Installation

Using pip

pip install ujeebu-scrapy

From source

git clone https://github.com/ujeebu/ujeebu-scrapy.git
cd ujeebu-scrapy
python setup.py install

Quick Start

1. Configure your Scrapy project

Add the middleware to your settings.py:

# Enable Ujeebu middleware
DOWNLOADER_MIDDLEWARES = {
    'ujeebu_scrapy.UjeebuMiddleware': 543,
}

# Ujeebu configuration
UJEEBU_ENABLED = True
UJEEBU_API_KEY = 'your-api-key'  # Get yours at https://ujeebu.com/signup

# Optional: Default settings
UJEEBU_DEFAULT_PROXY_TYPE = 'rotating'
UJEEBU_DEFAULT_TIMEOUT = 60

2. Use Ujeebu Requests in your spiders

import scrapy
from ujeebu_scrapy import UjeebuScrapeRequest

class MySpider(scrapy.Spider):
    name = 'my_spider'

    def start_requests(self):
        yield UjeebuScrapeRequest(
            url='https://example.com',
            js=True,               # Enable JavaScript rendering
            wait_for=2000,         # Wait 2 seconds for JS
            proxy_type='rotating', # Use rotating proxies
            callback=self.parse
        )

    def parse(self, response):
        # Parse the response as usual
        yield {'title': response.css('title::text').get()}

Request Classes

UjeebuScrapeRequest

For scraping web pages with full browser rendering support.

from ujeebu_scrapy import UjeebuScrapeRequest

yield UjeebuScrapeRequest(
    url='https://example.com',
    # Named parameters (recommended)
    js=True,                    # Enable JavaScript
    wait_for=2000,              # Wait time/selector/JS callable
    custom_js='...',            # Custom JS to execute
    proxy_type='premium',       # Proxy type
    proxy_country='US',         # Geo-targeting
    device='mobile',            # Device emulation
    scroll_down=True,           # Enable scrolling
    screenshot_fullpage=True,   # Full page screenshot
    block_ads=True,             # Block advertisements
    timeout=90,                 # Request timeout

    # Or use params dict
    params={
        'response_type': 'html',  # 'html', 'raw', 'pdf', 'screenshot'
        'extract_rules': {...},   # Extraction rules
    },

    # Custom headers (auto-prefixed with Ujb-)
    headers={'Authorization': 'Bearer token'},

    callback=self.parse
)

Scrape API Parameters

Parameter	Type	Default	Description
`url`	string	required	URL to scrape
`js`	boolean	true	Enable JavaScript execution
`response_type`	string	'html'	Response type: 'html', 'raw', 'pdf', 'screenshot'
`wait_for`	string/int	0	Wait condition: ms, CSS selector, or JS callable
`custom_js`	string	null	Custom JavaScript to execute
`proxy_type`	string	'rotating'	Proxy type (see Proxy Types)
`proxy_country`	string	'US'	Country code for geo-targeting
`device`	string	'desktop'	Device: 'desktop' or 'mobile'
`scroll_down`	boolean	false	Enable page scrolling
`progressive_scroll`	boolean	false	Keep scrolling until content stops loading
`screenshot_fullpage`	boolean	false	Capture full page screenshot
`block_ads`	boolean	false	Block advertisements
`block_resources`	boolean	false	Block images, CSS, fonts
`extract_rules`	object	null	Rules for structured data extraction
`timeout`	number	60	Request timeout in seconds

UjeebuExtractRequest

For extracting article content from news and blog pages.

from ujeebu_scrapy import UjeebuExtractRequest

yield UjeebuExtractRequest(
    url='https://example.com/article',
    text=True,         # Extract text content
    html=True,         # Extract HTML content
    images=True,       # Extract images
    author=True,       # Extract author
    pub_date=True,     # Extract publish date
    media=True,        # Extract embedded media
    feeds=True,        # Extract RSS feeds
    is_article=True,   # Get article probability score
    quick_mode=False,  # Use quick analysis
    js=False,          # Enable JS rendering
    callback=self.parse_article
)

Extract API Parameters

Parameter	Type	Default	Description
`url`	string	required	Article URL to extract
`text`	boolean	true	Extract article text
`html`	boolean	true	Extract article HTML
`images`	boolean	true	Extract images
`author`	boolean	true	Extract author name
`pub_date`	boolean	true	Extract publish date
`media`	boolean	false	Extract embedded media
`feeds`	boolean	false	Extract RSS feeds
`is_article`	boolean	true	Return article probability
`quick_mode`	boolean	false	Use faster analysis
`js`	boolean	false	Enable JavaScript rendering
`raw_html`	string	null	Extract from provided HTML

UjeebuSerpRequest

For getting Google search results.

from ujeebu_scrapy import UjeebuSerpRequest

yield UjeebuSerpRequest(
    search='python web scraping',
    search_type='search',   # 'search', 'images', 'news', 'videos', 'maps'
    lang='en',              # Language code
    location='us',          # Country code
    device='desktop',       # 'desktop', 'mobile', 'tablet'
    results_count=10,       # Results per page
    page=1,                 # Page number
    callback=self.parse_results
)

SERP API Parameters

Parameter	Type	Default	Description
`search`	string	required*	Search query
`url`	string	required*	Google search URL (alternative to search)
`search_type`	string	'search'	Type: 'search', 'images', 'news', 'videos', 'maps'
`lang`	string	'en'	Result language (ISO 639-1)
`location`	string	'us'	Country (ISO 3166-1 alpha-2)
`device`	string	'desktop'	Device type
`results_count`	number	10	Results per page
`page`	number	1	Page number
`extra_params`	string	null	Additional Google parameters

Extract Rules

Use extract rules to scrape structured data without writing CSS selectors in Python:

yield UjeebuScrapeRequest(
    url='https://quotes.toscrape.com/',
    extract_rules={
        'quotes': {
            'selector': '.quote',
            'type': 'obj',
            'multiple': True,
            'children': {
                'text': {
                    'selector': '.text',
                    'type': 'text'
                },
                'author': {
                    'selector': '.author',
                    'type': 'text'
                },
                'tags': {
                    'selector': '.tag',
                    'type': 'text',
                    'multiple': True
                }
            }
        }
    },
    callback=self.parse_quotes
)

def parse_quotes(self, response):
    import json
    data = json.loads(response.text)

    for quote in data['result']['quotes']:
        yield {
            'text': quote['text'],
            'author': quote['author'],
            'tags': quote['tags']
        }

Rule Types

Type	Description
`text`	Extract text content of element
`link`	Extract href attribute from links
`image`	Extract src attribute from images
`attr`	Extract specific attribute (use `attribute` param)
`obj`	Extract nested object (use `children` param)

Proxy Types

Type	Description	Use Case
`rotating`	Basic rotating proxies	General scraping
`advanced`	Enhanced rotating proxies	More reliable scraping
`premium`	Premium with geo-targeting	Location-specific content
`residential`	Real residential IPs	Anti-bot bypass
`mobile`	Mobile network IPs	Mobile-specific content
`custom`	Your own proxy	Custom infrastructure

Geo-Targeting Example

yield UjeebuScrapeRequest(
    url='https://google.com',
    proxy_type='premium',
    proxy_country='DE',  # Germany
    callback=self.parse
)

Sticky Sessions

Use the same IP across multiple requests:

yield UjeebuScrapeRequest(
    url='https://example.com/page1',
    params={
        'proxy_type': 'premium',
        'session_id': 'my_session_123',  # Reuse for 30 minutes
    },
    callback=self.parse
)

Screenshots & PDFs

Take a Screenshot

yield UjeebuScrapeRequest(
    url='https://example.com',
    params={
        'response_type': 'screenshot',
        'screenshot_fullpage': True,
        'json': True,
    },
    callback=self.save_screenshot
)

def save_screenshot(self, response):
    import base64
    data = json.loads(response.text)
    screenshot = base64.b64decode(data['screenshot'])
    with open('screenshot.png', 'wb') as f:
        f.write(screenshot)

Generate a PDF

yield UjeebuScrapeRequest(
    url='https://example.com',
    params={
        'response_type': 'pdf',
        'json': True,
    },
    callback=self.save_pdf
)

Infinite Scroll Handling

yield UjeebuScrapeRequest(
    url='https://example.com/infinite-scroll',
    params={
        'js': True,
        'scroll_down': True,
        'progressive_scroll': True,  # Keep scrolling until no new content
        'scroll_wait': 500,          # Wait 500ms between scrolls
    },
    callback=self.parse
)

Custom JavaScript

Execute custom JavaScript before getting the page:

# Click a button to load more content
custom_js = '''
document.querySelector('.load-more-btn').click();
'''

yield UjeebuScrapeRequest(
    url='https://example.com',
    params={
        'js': True,
        'custom_js': custom_js,
        'wait_for': 2000,
    },
    callback=self.parse
)

Settings Reference

Setting	Type	Default	Description
`UJEEBU_ENABLED`	boolean	True	Enable/disable middleware
`UJEEBU_API_KEY`	string	required	Your API key
`UJEEBU_BASE_URL`	string	https://api.ujeebu.com	API base URL
`UJEEBU_DEFAULT_PROXY_TYPE`	string	None	Default proxy type
`UJEEBU_DEFAULT_TIMEOUT`	int	60	Default timeout

Examples

The examples/ directory contains complete spider examples:

basic_scrape_spider.py: Basic scraping with JavaScript rendering
extract_rules_spider.py: Structured data extraction with extract_rules
article_extractor_spider.py: Article content extraction
serp_spider.py: Google search results
screenshot_spider.py: Screenshots and PDFs
infinite_scroll_spider.py: Handling infinite scroll pages
proxy_spider.py: Proxy types and geo-targeting

Error Handling

Handle Ujeebu API errors in your spider:

from scrapy.spidermiddlewares.httperror import HttpError

def errback_handler(self, failure):
    if failure.check(HttpError):
        response = failure.value.response
        self.logger.error(f'HttpError on {response.url}')

        # Check for Ujeebu-specific error
        try:
            error_data = json.loads(response.text)
            self.logger.error(f'Error: {error_data.get("message")}')
        except:
            pass

yield UjeebuScrapeRequest(
    url='https://example.com',
    callback=self.parse,
    errback=self.errback_handler
)

API Credits

Different operations consume different amounts of credits:

Operation	Credits
Scrape (rotating proxy)	1
Scrape (premium proxy)	5
Scrape (residential proxy)	10
Screenshot/PDF	+1
Extract	1-5
SERP	25

Check your usage at ujeebu.com/dashboard

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

📧 Email: support@ujeebu.com
📖 Documentation: ujeebu.com/docs
🐛 Issues: GitHub Issues

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.2.2

Dec 30, 2025

0.2.1

Dec 27, 2025

This version

0.2.0

Dec 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ujeebu_scrapy-0.2.0.tar.gz (18.8 kB view details)

Uploaded Dec 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ujeebu_scrapy-0.2.0-py3-none-any.whl (13.2 kB view details)

Uploaded Dec 26, 2025 Python 3

File details

Details for the file ujeebu_scrapy-0.2.0.tar.gz.

File metadata

Download URL: ujeebu_scrapy-0.2.0.tar.gz
Upload date: Dec 26, 2025
Size: 18.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ujeebu_scrapy-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7cf5a951b584fccb1eeedabe09c06047ef9b042404c3ccfeb6b89de4ceab3b98`
MD5	`fa76aa1bb813e785a5c6cca61640ea87`
BLAKE2b-256	`fe37484588732944fce23fa5dc45291ede2f7c6f1b609d7eb3e4e01deb7edfc5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ujeebu_scrapy-0.2.0.tar.gz:

Publisher: publish.yml on ujeebu/ujeebu-scrapy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ujeebu_scrapy-0.2.0.tar.gz
- Subject digest: 7cf5a951b584fccb1eeedabe09c06047ef9b042404c3ccfeb6b89de4ceab3b98
- Sigstore transparency entry: 779865856
- Sigstore integration time: Dec 26, 2025
Source repository:
- Permalink: ujeebu/ujeebu-scrapy@232eb8ba8fd57e0352f56518df1f774c0446d536
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ujeebu
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@232eb8ba8fd57e0352f56518df1f774c0446d536
- Trigger Event: release

File details

Details for the file ujeebu_scrapy-0.2.0-py3-none-any.whl.

File metadata

Download URL: ujeebu_scrapy-0.2.0-py3-none-any.whl
Upload date: Dec 26, 2025
Size: 13.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ujeebu_scrapy-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`53e8a01ed0c9ff2a261d0ebbe49b72b00820c16c5da5edac1348833c87029da1`
MD5	`bcb1d69fbbb4cd0e201f9f4c8f1dfee5`
BLAKE2b-256	`47514bd45ea1443d9597699de249210de3ca2a7123ba70021b384a5192981f9a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ujeebu_scrapy-0.2.0-py3-none-any.whl:

Publisher: publish.yml on ujeebu/ujeebu-scrapy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ujeebu_scrapy-0.2.0-py3-none-any.whl
- Subject digest: 53e8a01ed0c9ff2a261d0ebbe49b72b00820c16c5da5edac1348833c87029da1
- Sigstore transparency entry: 779865857
- Sigstore integration time: Dec 26, 2025
Source repository:
- Permalink: ujeebu/ujeebu-scrapy@232eb8ba8fd57e0352f56518df1f774c0446d536
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ujeebu
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@232eb8ba8fd57e0352f56518df1f774c0446d536
- Trigger Event: release

ujeebu-scrapy 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Ujeebu Scrapy - Scrapy Middleware for Ujeebu APIs

Features

Installation

Using pip

From source

Quick Start

1. Configure your Scrapy project

2. Use Ujeebu Requests in your spiders

Request Classes

UjeebuScrapeRequest

Scrape API Parameters

UjeebuExtractRequest

Extract API Parameters

UjeebuSerpRequest

SERP API Parameters

Extract Rules

Rule Types

Proxy Types

Geo-Targeting Example

Sticky Sessions

Screenshots & PDFs

Take a Screenshot

Generate a PDF

Infinite Scroll Handling

Custom JavaScript

Settings Reference

Examples

Error Handling

API Credits

Contributing

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance