Skip to main content

Async web crawling framework for everyone.

Project description

gain

CI PyPI Python License

Async web crawling framework for everyone.

Built on asyncio, aiohttp, and lxml/pyquery. Declare items and parsers; gain handles the concurrency, retries, and persistence.

Install

pip install gain

Linux users can opt into uvloop for an extra speed bump:

pip install "gain[uvloop]"

Requires Python 3.10+.

Quickstart

import aiofiles
from gain import Css, Item, Parser, Spider


class Post(Item):
    title = Css(".entry-title")
    content = Css(".entry-content")

    async def save(self):
        async with aiofiles.open("scrapinghub.txt", "a+") as f:
            await f.write(self.results["title"] + "\n")


class MySpider(Spider):
    concurrency = 5
    headers = {"User-Agent": "Google Spider"}
    start_url = "https://blog.scrapinghub.com/"
    parsers = [
        Parser(r"https://blog.scrapinghub.com/page/\d+/"),
        Parser(r"https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/", Post),
    ]


MySpider.run()

Run it:

python spider.py

XPath parsers

from gain import Css, Item, Parser, Spider, XPathParser


class Post(Item):
    title = Css(".breadcrumb_last")

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = "https://mydramatime.com/europe-and-us-drama/"
    concurrency = 5
    headers = {"User-Agent": "Google Spider"}
    parsers = [
        XPathParser('//span[@class="category-name"]/a/@href'),
        XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
        XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post),
    ]
    proxy = "https://localhost:1234"


MySpider.run()

How it works

   ┌────────────┐    ┌────────────┐    ┌────────────┐    ┌────────────┐
   │  start_url │ ─▶ │  Parser    │ ─▶ │  Item      │ ─▶ │ save()     │
   │            │    │  (follow)  │    │  (extract) │    │  (persist) │
   └────────────┘    └────────────┘    └────────────┘    └────────────┘
                          ▲                                      │
                          └──────────── new urls ────────────────┘
  1. Spider kicks off from start_url under a concurrency budget.
  2. Parsers either follow (one argument) — discovering more URLs to queue — or extract (two arguments) — instantiating an Item from each matching page.
  3. Items use Css / Xpath / Regex selectors to pull fields out of HTML.
  4. save() is your async hook to persist results — write a file, push to a queue, insert into a database.

Examples

See the example/ directory for runnable scripts against Scrapinghub, V2EX, and Sciencenet.

Development

git clone https://github.com/elliotgao2/gain.git
cd gain
uv sync                 # install deps into .venv
uv run pytest           # run tests
uv run ruff check .     # lint

We use uv for packaging and ruff for lint + format. Install the pre-commit hooks:

uv run pre-commit install

Contributing

Pull requests are welcome. For non-trivial changes, please open an issue first to discuss. Make sure pytest and ruff check pass before submitting.

License

MIT © Elliot Gao

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gain-1.0.0.tar.gz (287.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gain-1.0.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file gain-1.0.0.tar.gz.

File metadata

  • Download URL: gain-1.0.0.tar.gz
  • Upload date:
  • Size: 287.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.9

File hashes

Hashes for gain-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a463d43f0629a9bcfbf313837091612a0cdef54f9daaad9c8aeb842bb1dafd4a
MD5 4fe3fe71e7ec764e41b0dfd1c7412464
BLAKE2b-256 6dd00c8e9a9a965a290c609edfd8448a9795aeb65eaf4c72dc2dfb370d6212ba

See more details on using hashes here.

File details

Details for the file gain-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: gain-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.9

File hashes

Hashes for gain-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 76739f258ce74151d088951e4b05a8c14928371eabad0443d2996321787f8168
MD5 17868e215f4250dfba874cf34fb540fb
BLAKE2b-256 7d677e711ff3653b3987ca73ec00b6dba95c03b30c0630b6f5bc47442e9c05fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page