aioscrape·PyPI

Async scraping library

These details have not been verified by PyPI

Project links

Homepage

Project description

A scraping library on top of aiohttp and parsechain. Note that this is alpha software.

Installation

pip install aioscrape

Usage

from aioscrape import run, fetch, settings
from aioscrape.middleware import last_fetch, make_filecache
from aioscrape.utils import SOME_HEADERS # To not look like a bot

from urllib.parse import urljoin
from parsechain import C
from funcy import lcat, lconcat


def main():
    # Settings are scoped and can be redefined later with another "with"
    cache = make_filecache('.fcache')
    with settings(headers=SOME_HEADERS, middleware=[cache, last_fetch]):
        print(run(scrape_all()))


async def scrape_all():
    # All the settings in scope like headers and middleware are applied to fetch()
    start_page = await fetch(START_URL)

    # AioScrape integrates with parsechain to make extracting a breeze
    urls = start_page.css('.pagingLinks a').attrs('href')
    list_urls = [urljoin(start_page.url, page_url) for page_url in urls]

    # Using asyncio.wait() and friends to run requests in parallel
    list_pages = [start_page] + await wait_all(map(fetch, list_urls))

    # Scrape articles
    result = lcat(await wait_all(map(scrape_articles, list_pages)))
    write_to_csv('export.csv', result)


async def scrape_articles(list_page):
    urls = list_page.css('#headlines .titleLink').attrs('href')
    abs_urls = [urljoin(list_page.url, url) for url in urls]
    return await wait_all(map(scrape_article, abs_urls))


async def scrape_article(url):
    resp = await fetch(url)
    return resp.root.multi({
        'url': C.const(resp.url),
        'title': C.microdata('headline').first,
        'date': C.microdata('datePublished').first,
        'text': C.microdata('articleBody').first,
        'contacts': C.css('.sidebars .contact p')
                     .map(C.inner_html + html_to_text) + lconcat + ''.join,
    })


if __name__ == '__main__':
    main()

TODO

Response.follow()
non-GET requests
work with forms

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.2

May 11, 2019

0.0.1

Feb 8, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aioscrape-0.0.2.tar.gz (6.5 kB view details)

Uploaded May 11, 2019 Source

Built Distribution

aioscrape-0.0.2-py2.py3-none-any.whl (9.0 kB view details)

Uploaded May 11, 2019 Python 2Python 3

File details

Details for the file aioscrape-0.0.2.tar.gz.

File metadata

Download URL: aioscrape-0.0.2.tar.gz
Upload date: May 11, 2019
Size: 6.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/2.7

File hashes

Hashes for aioscrape-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`0610acaf29b493b94cc7e9744f10ebbfc25d836e05e9100b2a28d50b1ab71bbd`
MD5	`16d964f39b6221fad8e177fc0525d810`
BLAKE2b-256	`ac7693fa57f4c0458a1dc1de563ab0ce6d40fada2cecc5df13abaa7e0afaefc1`

See more details on using hashes here.

File details

Details for the file aioscrape-0.0.2-py2.py3-none-any.whl.

File metadata

Download URL: aioscrape-0.0.2-py2.py3-none-any.whl
Upload date: May 11, 2019
Size: 9.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/2.7

File hashes

Hashes for aioscrape-0.0.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`80e16ade3c66d571de176a59b44eba5b5762868dcb380839eed31a7d5dc8c6b7`
MD5	`1b9ee39a8e5e858f99a263390725143b`
BLAKE2b-256	`b0b64e86c9d78cf1716e451d3cb70d2ff0e1ac48d2448ba01e81b3653a3c128d`

See more details on using hashes here.

aioscrape 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

TODO

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes