A robust, all-in-one Python web scraping toolkit
Project description
PyScrappy: robust, all-in-one Python web scraping toolkit
PyScrappy is a Python toolkit for web scraping that works out of the box. Point it at any URL and get structured data back — or use built-in scrapers for Wikipedia, IMDB, Yahoo Finance, news feeds, and more.
Key features
- Generic scraper — give it any URL, get back structured text, links, images, tables, and metadata
- Auto-pagination — automatically follows "next page" links
- JS rendering — optional Playwright backend for JavaScript-heavy sites
- Custom selectors — pass CSS selectors to extract exactly what you need
- Built-in scrapers — Wikipedia, IMDB, Yahoo Finance, news (RSS), image search, Amazon, LinkedIn
- Clean API — every scraper returns a
ScrapeResultwith.to_dataframe()and.to_json() - Retry & rate-limiting — built-in exponential backoff and per-domain rate limiting
- Type-safe — full type hints,
py.typedmarker
Installation
pip install pyscrappy
Optional extras:
# Browser support (for JS-rendered pages)
pip install 'pyscrappy[browser]'
playwright install chromium
# DataFrame support
pip install 'pyscrappy[dataframe]'
# Everything
pip install 'pyscrappy[all]'
Quick start
Scrape any URL (one-liner)
from pyscrappy import scrape
result = scrape("https://en.wikipedia.org/wiki/Web_scraping")
print(result.data[0]["metadata"]["title"])
print(result.data[0]["text"]["word_count"])
Custom CSS selectors
from pyscrappy import GenericScraper
with GenericScraper() as gs:
result = gs.scrape(
url="https://news.ycombinator.com",
selectors={"title": ".titleline a", "score": ".score"},
)
for item in result.data:
print(item["title"], item.get("score", ""))
Wikipedia
from pyscrappy import WikipediaScraper
with WikipediaScraper() as ws:
result = ws.scrape(query="Python (programming language)", mode="summary")
print(result.data[0]["text"])
Stock data
from pyscrappy import StockScraper
with StockScraper() as ss:
result = ss.scrape(symbol="AAPL", mode="history", period="1mo")
df = result.to_dataframe()
print(df.head())
IMDB
from pyscrappy import IMDBScraper
with IMDBScraper() as scraper:
result = scraper.scrape(genre="sci-fi", max_pages=2)
df = result.to_dataframe()
print(df[["title", "year", "rating"]])
News (RSS feeds)
from pyscrappy import NewsScraper
with NewsScraper() as ns:
result = ns.scrape(feed_url="https://rss.nytimes.com/services/xml/rss/nyt/World.xml")
for article in result.data[:5]:
print(article["title"])
Image search
from pyscrappy import ImageSearchScraper
with ImageSearchScraper() as iss:
result = iss.scrape(query="golden retriever", max_images=10, download_to="./dogs")
Configuration
from pyscrappy import ScraperConfig, GenericScraper
config = ScraperConfig(
timeout=20.0, # request timeout in seconds
max_retries=3, # retry failed requests
rate_limit=2.0, # seconds between requests per domain
proxy="http://...", # HTTP/SOCKS proxy
headless=True, # browser runs headless
render_js="auto", # auto-detect if JS rendering is needed
)
with GenericScraper(config) as gs:
result = gs.scrape(url="https://example.com")
YouTube
from pyscrappy import YouTubeScraper
with YouTubeScraper() as scraper:
result = scraper.scrape(query="python tutorial", max_results=10)
for video in result.data:
print(video["title"], video.get("views", ""))
SoundCloud
from pyscrappy import SoundCloudScraper
with SoundCloudScraper() as scraper:
result = scraper.scrape(query="lo-fi beats", max_results=10)
E-Commerce (Alibaba, Flipkart, Snapdeal)
from pyscrappy import AlibabaScraper, FlipkartScraper, SnapdealScraper
with FlipkartScraper() as scraper:
result = scraper.scrape(query="laptop", max_pages=2)
df = result.to_dataframe()
Food Delivery (Swiggy, Zomato)
from pyscrappy import SwiggyScraper, ZomatoScraper
# These are JS-heavy — use render_js=True for best results
with SwiggyScraper() as scraper:
result = scraper.scrape(city="bangalore", render_js=True)
Built-in scrapers
| Scraper | What it does | Needs browser? |
|---|---|---|
GenericScraper |
Scrape any URL with auto-extraction | Optional |
| Data / Research | ||
WikipediaScraper |
Articles, sections, infoboxes | No |
IMDBScraper |
Movies by genre, search, charts | No |
StockScraper |
Quotes, history, profiles (Yahoo Finance) | No |
NewsScraper |
RSS/Atom feeds, article extraction | No |
ImageSearchScraper |
Image search + download | No |
LinkedInJobsScraper |
Public job listings | No |
| E-Commerce | ||
AmazonScraper |
Product search | No |
AlibabaScraper |
Product search | No |
FlipkartScraper |
Product search | No |
SnapdealScraper |
Product search | No |
| Social Media | ||
YouTubeScraper |
Video search, channel scraping | Optional |
InstagramScraper |
Profiles, hashtag posts | Recommended |
TwitterScraper |
Tweet search | Recommended |
| Music | ||
SpotifyScraper |
Track/playlist search | Recommended |
SoundCloudScraper |
Track search | Optional |
| Food Delivery | ||
SwiggyScraper |
Restaurant listings | Recommended |
ZomatoScraper |
Restaurant listings | Recommended |
Dependencies
Required: httpx, beautifulsoup4, lxml
Optional: playwright (JS rendering), pandas (DataFrames)
License
Contributing
All contributions welcome. See Issues.
This package is for educational and research purposes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyscrappy-1.0.0.tar.gz.
File metadata
- Download URL: pyscrappy-1.0.0.tar.gz
- Upload date:
- Size: 86.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb7ea2f017e8150279a1773b5071f1e394f588dca28689f15a7fb4854ccf469e
|
|
| MD5 |
999700385a59a5151f62f0f01998357d
|
|
| BLAKE2b-256 |
070c2e9f1b8795361623fc9c861fa88780e761f3d45c88a711a8b993a7049eea
|
File details
Details for the file pyscrappy-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pyscrappy-1.0.0-py3-none-any.whl
- Upload date:
- Size: 55.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f58891ca7ef2884542348ff8ada7871660c6970c6fc485509bd911972f6c008c
|
|
| MD5 |
7b39a6c46a6df414d22bff35a43da1c9
|
|
| BLAKE2b-256 |
3e5833efcf461408df711db860e8a4fd1906a3826352aaff7ae81ff7a3fba605
|