Smart web scraper that abstracts away complexity - from simple sites to highly protected ones.
Project description
IntelliScraper
A powerful, anti-bot detection web scraping solution built with Playwright, designed for scraping protected sites like Himalayas Jobs and other platforms that require authentication. Features session management, proxy support, and advanced HTML parsing capabilities.
✨ Features
- 🔐 Session Management: Capture and reuse authentication sessions with cookies, local storage, and browser fingerprints
- 🛡️ Anti-Detection: Advanced techniques to prevent bot detection
- 🌐 Proxy Support: Integrated support for Bright Data and custom proxy solutions
- 📝 HTML Parsing: Extract text, links, and convert to Markdown format (including LLM-optimized output)
- 🎯 CLI Tool: Easy-to-use command-line interface for session generation
- ⚡ Playwright-Powered: Built on robust Playwright automation framework
🚀 Quick Start
Installation
# Install the package
pip install intelliscraper-core
# Install Playwright browser (Chromium)
playwright install chromium
[!NOTE]
Playwright requires browser binaries to be installed separately.
The command above installs Chromium, which is necessary for this library to work.
For more reference : https://pypi.org/project/intelliscraper-core/
Basic Scraping (No Authentication)
from intelliscraper import Scraper, ScrapStatus
# Simple scraping without authentication
scraper = Scraper()
response = scraper.scrape("https://example.com")
if response.status == ScrapStatus.COMPLETED:
print(response.scrap_html_content)
Creating Session Data
Use the CLI tool to create session data for authenticated scraping. The tool will open a browser where you can manually log in:
intelliscraper-session --url "https://himalayas.app" --site "himalayas" --output "./himalayas_session.json"
How it works:
- 🌐 Opens browser with the specified URL
- 🔐 You manually log in with your credentials
- ⏎ Press Enter after successful login
- 💾 Session data (cookies, storage, fingerprints) saved to JSON file
Authenticated Scraping with Session
import json
from intelliscraper import Scraper, Session, ScrapStatus
# Load session data
with open("himalayas_session.json") as f:
session = Session(**json.load(f))
# Scrape with authentication
scraper = Scraper(session_data=session)
response = scraper.scrape("https://himalayas.app/jobs/python?experience=entry-level%2Cmid-level")
if response.status == ScrapStatus.COMPLETED:
print("Successfully scraped authenticated page!")
print(response.scrap_html_content)
📝 HTML Parsing
Parse scraped content to extract text, links, and markdown:
from intelliscraper import Scraper, ScrapStatus, HTMLParser
scraper = Scraper()
response = scraper.scrape("https://example.com")
if response.status == ScrapStatus.COMPLETED:
# Initialize parser
parser = HTMLParser(
url=response.scrape_request.url,
html=response.scrap_html_content
)
# Extract different formats
print(parser.text) # Plain text
print(parser.links) # All links (normalized URLs)
print(parser.markdown) # Full markdown
print(parser.markdown_for_llm) # Clean markdown for AI (removes nav, footer, ads)
The markdown_for_llm property is optimized for AI processing - it removes navigation, footers, advertisements, and forms, keeping only useful content.
🌐 Proxy Support
IntelliScraper supports proxy configurations including Bright Data and custom solutions:
from intelliscraper import Scraper, ProxyConfig
proxy = ProxyConfig(
url="http://brd.superproxy.io:22225",
username="your-username",
password="your-password"
)
scraper = Scraper(proxy=proxy)
response = scraper.scrape("https://example.com")
📁 More examples including proxy configurations, and advanced usage can be found in the
examples/folder.
📋 Requirements
- Python 3.12+
- Playwright
- Compatible with Windows, macOS, and Linux
🗺️ Roadmap
- ✅ Session management with CLI tool
- ✅ Proxy support (Bright Data)
- ✅ HTML parsing and Markdown conversion
- ✅ Anti-detection features
- 🔄 PyPI package (Coming soon)
- 🔄 Async scraping support
- 🔄 Web crawler
- 🔄 AI integration
📄 License
This project is licensed under the MIT License.
📧 Support
For issues, questions, or contributions, please visit our GitHub repository's issues page.
Note: This project is under active development. The package will be available on PyPI in the coming weeks.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file intelliscraper_core-0.1.1.tar.gz.
File metadata
- Download URL: intelliscraper_core-0.1.1.tar.gz
- Upload date:
- Size: 43.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6709e7cd17610820f4dc3db47dd733ac819658e622feaa94944e36b95ace1da
|
|
| MD5 |
16319bd3fd8ca7f12185f0410bb5ad41
|
|
| BLAKE2b-256 |
80bb027fc6d214fb0c5032024c0b4f89f92eafcd27c75a785921933875d85ef2
|
File details
Details for the file intelliscraper_core-0.1.1-py3-none-any.whl.
File metadata
- Download URL: intelliscraper_core-0.1.1-py3-none-any.whl
- Upload date:
- Size: 5.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ec8f305d80e1de700fe6ffd606a9180f12a410df79e5a92c0d3e4bc603d5542
|
|
| MD5 |
d7d6c02521e08eb2254b547b71768992
|
|
| BLAKE2b-256 |
7e61346cf386e1b40661365a48eb79de0fb5e13e6c1df4728d0a6902570649d1
|