Skip to main content

Async scraping library

Project description

A scraping library on top of aiohttp and parsechain. Note that this is alpha software.

Installation

pip install aioscrape

Usage

from aioscrape import run, fetch, settings
from aioscrape.middleware import last_fetch, make_filecache
from aioscrape.utils import SOME_HEADERS # To not look like a bot

from urllib.parse import urljoin
from parsechain import C
from funcy import lcat, lconcat


def main():
    # Settings are scoped and can be redefined later with another "with"
    cache = make_filecache('.fcache')
    with settings(headers=SOME_HEADERS, middleware=[cache, last_fetch]):
        print(run(scrape_all()))


async def scrape_all():
    # All the settings in scope like headers and middleware are applied to fetch()
    start_page = await fetch(START_URL)

    # AioScrape integrates with parsechain to make extracting a breeze
    urls = start_page.css('.pagingLinks a').attrs('href')
    list_urls = [urljoin(start_page.url, page_url) for page_url in urls]

    # Using asyncio.wait() and friends to run requests in parallel
    list_pages = [start_page] + await wait_all(map(fetch, list_urls))

    # Scrape articles
    result = lcat(await wait_all(map(scrape_articles, list_pages)))
    write_to_csv('export.csv', result)


async def scrape_articles(list_page):
    urls = list_page.css('#headlines .titleLink').attrs('href')
    abs_urls = [urljoin(list_page.url, url) for url in urls]
    return await wait_all(map(scrape_article, abs_urls))


async def scrape_article(url):
    resp = await fetch(url)
    return resp.root.multi({
        'url': C.const(resp.url),
        'title': C.microdata('headline').first,
        'date': C.microdata('datePublished').first,
        'text': C.microdata('articleBody').first,
        'contacts': C.css('.sidebars .contact p')
                     .map(C.inner_html + html_to_text) + lconcat + ''.join,
    })


if __name__ == '__main__':
    main()

TODO

  • Response.follow()

  • Response.abs()

  • non-GET requests

  • work with forms

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aioscrape-0.0.1.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

aioscrape-0.0.1-py2.py3-none-any.whl (7.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file aioscrape-0.0.1.tar.gz.

File metadata

  • Download URL: aioscrape-0.0.1.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/2.7

File hashes

Hashes for aioscrape-0.0.1.tar.gz
Algorithm Hash digest
SHA256 a17fb6cd3b481b0946dc0bf28dbc25b6b36313ea97cab69e23b623b6a3eff99f
MD5 44af30e822c0fd0f5fce4c37655eff15
BLAKE2b-256 2036a88813b2a51f6f4799844afa6ee213a0ce6da6e1096fd47de6caa5144d1e

See more details on using hashes here.

File details

Details for the file aioscrape-0.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for aioscrape-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 984c0da6775a04f4a18df6a5f95112a9794844f69fee5d503d18a77177b3b71b
MD5 9fbb67c8c65957c45910e2fd431d98dd
BLAKE2b-256 bbc0df53b8c8838fdb9fc7bc03f2204141520aec9705333c0f624f70eaa9cb80

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page