Async scraping library
Project description
A scraping library on top of aiohttp and parsechain. Note that this is alpha software.
Installation
pip install aioscrape
Usage
from aioscrape import run, fetch, settings
from aioscrape.middleware import last_fetch, make_filecache
from aioscrape.utils import SOME_HEADERS # To not look like a bot
from urllib.parse import urljoin
from parsechain import C
from funcy import lcat, lconcat
def main():
# Settings are scoped and can be redefined later with another "with"
cache = make_filecache('.fcache')
with settings(headers=SOME_HEADERS, middleware=[cache, last_fetch]):
print(run(scrape_all()))
async def scrape_all():
# All the settings in scope like headers and middleware are applied to fetch()
start_page = await fetch(START_URL)
# AioScrape integrates with parsechain to make extracting a breeze
urls = start_page.css('.pagingLinks a').attrs('href')
list_urls = [urljoin(start_page.url, page_url) for page_url in urls]
# Using asyncio.wait() and friends to run requests in parallel
list_pages = [start_page] + await wait_all(map(fetch, list_urls))
# Scrape articles
result = lcat(await wait_all(map(scrape_articles, list_pages)))
write_to_csv('export.csv', result)
async def scrape_articles(list_page):
urls = list_page.css('#headlines .titleLink').attrs('href')
abs_urls = [urljoin(list_page.url, url) for url in urls]
return await wait_all(map(scrape_article, abs_urls))
async def scrape_article(url):
resp = await fetch(url)
return resp.root.multi({
'url': C.const(resp.url),
'title': C.microdata('headline').first,
'date': C.microdata('datePublished').first,
'text': C.microdata('articleBody').first,
'contacts': C.css('.sidebars .contact p')
.map(C.inner_html + html_to_text) + lconcat + ''.join,
})
if __name__ == '__main__':
main()
TODO
Response.follow()
Response.abs()
non-GET requests
work with forms
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
aioscrape-0.0.1.tar.gz
(4.5 kB
view details)
Built Distribution
File details
Details for the file aioscrape-0.0.1.tar.gz
.
File metadata
- Download URL: aioscrape-0.0.1.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Python-urllib/2.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a17fb6cd3b481b0946dc0bf28dbc25b6b36313ea97cab69e23b623b6a3eff99f |
|
MD5 | 44af30e822c0fd0f5fce4c37655eff15 |
|
BLAKE2b-256 | 2036a88813b2a51f6f4799844afa6ee213a0ce6da6e1096fd47de6caa5144d1e |
File details
Details for the file aioscrape-0.0.1-py2.py3-none-any.whl
.
File metadata
- Download URL: aioscrape-0.0.1-py2.py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Python-urllib/2.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 984c0da6775a04f4a18df6a5f95112a9794844f69fee5d503d18a77177b3b71b |
|
MD5 | 9fbb67c8c65957c45910e2fd431d98dd |
|
BLAKE2b-256 | bbc0df53b8c8838fdb9fc7bc03f2204141520aec9705333c0f624f70eaa9cb80 |