Async web crawling framework for everyone.
Project description
gain
Async web crawling framework for everyone.
Built on asyncio, aiohttp, and lxml/pyquery. Declare items and
parsers; gain handles the concurrency, retries, and persistence.
Install
pip install gain
Linux users can opt into uvloop for an extra speed bump:
pip install "gain[uvloop]"
Requires Python 3.10+.
Quickstart
import aiofiles
from gain import Css, Item, Parser, Spider
class Post(Item):
title = Css(".entry-title")
content = Css(".entry-content")
async def save(self):
async with aiofiles.open("scrapinghub.txt", "a+") as f:
await f.write(self.results["title"] + "\n")
class MySpider(Spider):
concurrency = 5
headers = {"User-Agent": "Google Spider"}
start_url = "https://blog.scrapinghub.com/"
parsers = [
Parser(r"https://blog.scrapinghub.com/page/\d+/"),
Parser(r"https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/", Post),
]
MySpider.run()
Run it:
python spider.py
XPath parsers
from gain import Css, Item, Parser, Spider, XPathParser
class Post(Item):
title = Css(".breadcrumb_last")
async def save(self):
print(self.title)
class MySpider(Spider):
start_url = "https://mydramatime.com/europe-and-us-drama/"
concurrency = 5
headers = {"User-Agent": "Google Spider"}
parsers = [
XPathParser('//span[@class="category-name"]/a/@href'),
XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post),
]
proxy = "https://localhost:1234"
MySpider.run()
How it works
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ start_url │ ─▶ │ Parser │ ─▶ │ Item │ ─▶ │ save() │
│ │ │ (follow) │ │ (extract) │ │ (persist) │
└────────────┘ └────────────┘ └────────────┘ └────────────┘
▲ │
└──────────── new urls ────────────────┘
- Spider kicks off from
start_urlunder a concurrency budget. - Parsers either follow (one argument) — discovering more URLs to
queue — or extract (two arguments) — instantiating an
Itemfrom each matching page. - Items use
Css/Xpath/Regexselectors to pull fields out of HTML. save()is your async hook to persist results — write a file, push to a queue, insert into a database.
Examples
See the example/ directory for runnable scripts against
Scrapinghub, V2EX, and Sciencenet.
Development
git clone https://github.com/elliotgao2/gain.git
cd gain
uv sync # install deps into .venv
uv run pytest # run tests
uv run ruff check . # lint
We use uv for packaging and ruff for lint + format. Install the pre-commit hooks:
uv run pre-commit install
Contributing
Pull requests are welcome. For non-trivial changes, please open an issue
first to discuss. Make sure pytest and ruff check pass before
submitting.
License
MIT © Elliot Gao
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gain-1.0.1.tar.gz.
File metadata
- Download URL: gain-1.0.1.tar.gz
- Upload date:
- Size: 287.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3baa63a10456a4b0d9da49aa7289f0cee3eb938fdb0bfe4bf7f4360fbc49144a
|
|
| MD5 |
c3fb60dd33700c847c9565aa0bdbf3f5
|
|
| BLAKE2b-256 |
4216248383926b477cf50b363dfd8625525e838f851e7d2aa171c35d237da0f5
|
File details
Details for the file gain-1.0.1-py3-none-any.whl.
File metadata
- Download URL: gain-1.0.1-py3-none-any.whl
- Upload date:
- Size: 9.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d3b555071b7a33d9b4267d154b3a761cefb87bec4a6c25d7a1bdb482ec4c214
|
|
| MD5 |
2b627182f66de081d9e6816db704ebf6
|
|
| BLAKE2b-256 |
8e68b1cf6c93d8d6ed0e30adb0cd035f2eb923be81ea15b79639eecf5f0687ef
|