Async web crawling framework for everyone.

These details have not been verified by PyPI

Project links

Project description

gain

Async web crawling framework for everyone.

Built on asyncio, aiohttp, and lxml/pyquery. Declare items and parsers; gain handles the concurrency, retries, and persistence.

Install

pip install gain

Linux users can opt into uvloop for an extra speed bump:

pip install "gain[uvloop]"

Requires Python 3.10+.

Quickstart

import aiofiles
from gain import Css, Item, Parser, Spider


class Post(Item):
    title = Css(".entry-title")
    content = Css(".entry-content")

    async def save(self):
        async with aiofiles.open("scrapinghub.txt", "a+") as f:
            await f.write(self.results["title"] + "\n")


class MySpider(Spider):
    concurrency = 5
    headers = {"User-Agent": "Google Spider"}
    start_url = "https://blog.scrapinghub.com/"
    parsers = [
        Parser(r"https://blog.scrapinghub.com/page/\d+/"),
        Parser(r"https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/", Post),
    ]


MySpider.run()

Run it:

python spider.py

XPath parsers

from gain import Css, Item, Parser, Spider, XPathParser


class Post(Item):
    title = Css(".breadcrumb_last")

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = "https://mydramatime.com/europe-and-us-drama/"
    concurrency = 5
    headers = {"User-Agent": "Google Spider"}
    parsers = [
        XPathParser('//span[@class="category-name"]/a/@href'),
        XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
        XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post),
    ]
    proxy = "https://localhost:1234"


MySpider.run()

How it works

   ┌────────────┐    ┌────────────┐    ┌────────────┐    ┌────────────┐
   │  start_url │ ─▶ │  Parser    │ ─▶ │  Item      │ ─▶ │ save()     │
   │            │    │  (follow)  │    │  (extract) │    │  (persist) │
   └────────────┘    └────────────┘    └────────────┘    └────────────┘
                          ▲                                      │
                          └──────────── new urls ────────────────┘

Spider kicks off from start_url under a concurrency budget.
Parsers either follow (one argument) — discovering more URLs to queue — or extract (two arguments) — instantiating an Item from each matching page.
Items use Css / Xpath / Regex selectors to pull fields out of HTML.
save() is your async hook to persist results — write a file, push to a queue, insert into a database.

Examples

See the example/ directory for runnable scripts against Scrapinghub, V2EX, and Sciencenet.

Development

git clone https://github.com/elliotgao2/gain.git
cd gain
uv sync                 # install deps into .venv
uv run pytest           # run tests
uv run ruff check .     # lint

We use uv for packaging and ruff for lint + format. Install the pre-commit hooks:

uv run pre-commit install

Contributing

Pull requests are welcome. For non-trivial changes, please open an issue first to discuss. Make sure pytest and ruff check pass before submitting.

License

MIT © Elliot Gao

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

May 22, 2026

This version

1.0.0

May 22, 2026

0.1.4

Jun 19, 2017

0.1.3

Jun 6, 2017

0.1.2

Jun 5, 2017

0.1.1

Jun 2, 2017

0.1.0

May 31, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gain-1.0.0.tar.gz (287.3 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gain-1.0.0-py3-none-any.whl (9.7 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file gain-1.0.0.tar.gz.

File metadata

Download URL: gain-1.0.0.tar.gz
Upload date: May 22, 2026
Size: 287.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.9

File hashes

Hashes for gain-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a463d43f0629a9bcfbf313837091612a0cdef54f9daaad9c8aeb842bb1dafd4a`
MD5	`4fe3fe71e7ec764e41b0dfd1c7412464`
BLAKE2b-256	`6dd00c8e9a9a965a290c609edfd8448a9795aeb65eaf4c72dc2dfb370d6212ba`

See more details on using hashes here.

File details

Details for the file gain-1.0.0-py3-none-any.whl.

File metadata

Download URL: gain-1.0.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.9

File hashes

Hashes for gain-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`76739f258ce74151d088951e4b05a8c14928371eabad0443d2996321787f8168`
MD5	`17868e215f4250dfba874cf34fb540fb`
BLAKE2b-256	`7d677e711ff3653b3987ca73ec00b6dba95c03b30c0630b6f5bc47442e9c05fa`

See more details on using hashes here.

gain 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gain

Install

Quickstart

XPath parsers

How it works

Examples

Development

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes