Skip to main content

Async web crawling framework for everyone.

Project description

gain

CI PyPI Python License Downloads Ruff uv

Async web crawling framework for everyone.

Built on asyncio, aiohttp, and lxml/pyquery. Declare items and parsers; gain handles the concurrency, retries, and persistence.

Install

pip install gain

Linux users can opt into uvloop for an extra speed bump:

pip install "gain[uvloop]"

Requires Python 3.10+.

Quickstart

import aiofiles
from gain import Css, Item, Parser, Spider


class Post(Item):
    title = Css(".entry-title")
    content = Css(".entry-content")

    async def save(self):
        async with aiofiles.open("scrapinghub.txt", "a+") as f:
            await f.write(self.results["title"] + "\n")


class MySpider(Spider):
    concurrency = 5
    headers = {"User-Agent": "Google Spider"}
    start_url = "https://blog.scrapinghub.com/"
    parsers = [
        Parser(r"https://blog.scrapinghub.com/page/\d+/"),
        Parser(r"https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/", Post),
    ]


MySpider.run()

Run it:

python spider.py

XPath parsers

from gain import Css, Item, Parser, Spider, XPathParser


class Post(Item):
    title = Css(".breadcrumb_last")

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = "https://mydramatime.com/europe-and-us-drama/"
    concurrency = 5
    headers = {"User-Agent": "Google Spider"}
    parsers = [
        XPathParser('//span[@class="category-name"]/a/@href'),
        XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
        XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post),
    ]
    proxy = "https://localhost:1234"


MySpider.run()

How it works

   ┌────────────┐    ┌────────────┐    ┌────────────┐    ┌────────────┐
   │  start_url │ ─▶ │  Parser    │ ─▶ │  Item      │ ─▶ │ save()     │
   │            │    │  (follow)  │    │  (extract) │    │  (persist) │
   └────────────┘    └────────────┘    └────────────┘    └────────────┘
                          ▲                                      │
                          └──────────── new urls ────────────────┘
  1. Spider kicks off from start_url under a concurrency budget.
  2. Parsers either follow (one argument) — discovering more URLs to queue — or extract (two arguments) — instantiating an Item from each matching page.
  3. Items use Css / Xpath / Regex selectors to pull fields out of HTML.
  4. save() is your async hook to persist results — write a file, push to a queue, insert into a database.

Examples

See the example/ directory for runnable scripts against Scrapinghub, V2EX, and Sciencenet.

Development

git clone https://github.com/elliotgao2/gain.git
cd gain
uv sync                 # install deps into .venv
uv run pytest           # run tests
uv run ruff check .     # lint

We use uv for packaging and ruff for lint + format. Install the pre-commit hooks:

uv run pre-commit install

Contributing

Pull requests are welcome. For non-trivial changes, please open an issue first to discuss. Make sure pytest and ruff check pass before submitting.

License

MIT © Elliot Gao

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gain-1.0.1.tar.gz (287.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gain-1.0.1-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file gain-1.0.1.tar.gz.

File metadata

  • Download URL: gain-1.0.1.tar.gz
  • Upload date:
  • Size: 287.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.9

File hashes

Hashes for gain-1.0.1.tar.gz
Algorithm Hash digest
SHA256 3baa63a10456a4b0d9da49aa7289f0cee3eb938fdb0bfe4bf7f4360fbc49144a
MD5 c3fb60dd33700c847c9565aa0bdbf3f5
BLAKE2b-256 4216248383926b477cf50b363dfd8625525e838f851e7d2aa171c35d237da0f5

See more details on using hashes here.

File details

Details for the file gain-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: gain-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.9

File hashes

Hashes for gain-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7d3b555071b7a33d9b4267d154b3a761cefb87bec4a6c25d7a1bdb482ec4c214
MD5 2b627182f66de081d9e6816db704ebf6
BLAKE2b-256 8e68b1cf6c93d8d6ed0e30adb0cd035f2eb923be81ea15b79639eecf5f0687ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page