Your easy-to-use, fast and powerful web scraping library
Project description
Pyscalpel
Your easy-to-use, fast and powerful web scraping library.
Why?
I already known scrapy which is the reference in python for web scraping. But two things bothered me.
- I feel like scrapy cannot integrate into an existing project, you need to treat your web scraping stuff like a project on its own.
- Usage of Twisted who is a veteran in asynchronous programming, but I think that there are better asynchronous frameworks today. Note that this second point is not true anymore as I'm writing the document since scrapy adds support for asyncio
After having made this observation I decided to create pyscalpel. And let's be honest, I also want to have my own web scraping library, and it is fun to write one ;)
Installation
pip install pyscalpel[gevent] # to install the gevent backend
pip install pyscalpel[trio] # to installl the trio backend
pip install pyscalpel[full] # to install all the backends
If you know about poetry you can use it instead of pip.
poetry add pyscalpel[gevent] # to install the gevent backend
poetry add pyscalpel[trio] # to install the trio backend
poetry add pyscalpel[full] # to install all the backends
pyscalpel works starting from python 3.6, it relies on robust packages:
- configuror: A configuration toolkit.
- httpx: A modern http client.
- selenium: A library for controlling a browser.
- gevent: An asynchronous framework using the synchronous way. (optional)
- trio: A modern asynchronous framework using
async/await
syntax. (optional) - parsel: A library elements in HTML/XML documents.
- attrs: A library helping to write classes without pain.
- fake-useragent: A simple library to fake a user agent.
- rfc3986: A library for url parsing and validation.
- msgpack: A library allowing for fast serialization/deserialization of data structures.
Documentation
The documentation is in progress.
Usage
To give you an overview of what can be done, this is a simple example of quote scraping. Don't hesitate to look at the examples folder for more snippets to look at.
with gevent
from pathlib import Path
from scalpel import Configuration
from scalpel.green import StaticSpider, StaticResponse, read_mp
def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
data = {
'message': quote.xpath('./span[@class="text"]/text()').get(),
'author': quote.xpath('./span/small/text()').get(),
'tags': quote.xpath('./div/a/text()').getall()
}
spider.save_item(data)
next_link = response.xpath('//nav/ul/li[@class="next"]/a').xpath('@href').get()
if next_link is not None:
response.follow(next_link)
if __name__ == '__main__':
backup = Path(__file__).parent / 'backup.mp'
config = Configuration(backup_filename=f'{backup}')
spider = StaticSpider(urls=['http://quotes.toscrape.com'], parse=parse, config=config)
spider.run()
print(spider.statistics())
# you can do whatever you want with the results
for quote_data in read_mp(filename=backup, decoder=spider.config.msgpack_decoder):
print(quote_data)
with trio
from pathlib import Path
import trio
from scalpel import Configuration
from scalpel.trionic import StaticResponse, StaticSpider, read_mp
async def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
data = {
'message': quote.xpath('./span[@class="text"]/text()').get(),
'author': quote.xpath('./span/small/text()').get(),
'tags': quote.xpath('./div/a/text()').getall()
}
await spider.save_item(data)
next_link = response.xpath('//nav/ul/li[@class="next"]/a').xpath('@href').get()
if next_link is not None:
await response.follow(next_link)
async def main():
backup = Path(__file__).parent / 'backup.mp'
config = Configuration(backup_filename=f'{backup}')
spider = StaticSpider(urls=['http://quotes.toscrape.com'], parse=parse, config=config)
await spider.run()
print(spider.statistics())
# you can do whatever you want with the results
async for item in read_mp(backup, decoder=spider.config.msgpack_decoder):
print(item)
if __name__ == '__main__':
trio.run(main)
Known limitations
pyscalpel aims to handle SPA (single page application) through the use of selenium. However due to the synchronous nature of selenium, it is hard to leverage trio and gevent asynchronous feature. You will notice that the selenium spider is slower than the static spider. For more information look at the documentation.
Warning
pyscalpel is a young project so it is expected to have breaking changes in the api without respecting the semver principle. It is recommended to pin the version you are using for now.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyscalpel-0.1.0.tar.gz
.
File metadata
- Download URL: pyscalpel-0.1.0.tar.gz
- Upload date:
- Size: 32.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.6.10 Linux/4.15.0-121-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14732514d84a78ed69b16c10f88878a23ee3013fcf434c0e5be1a2da22f3cb96 |
|
MD5 | 240ffabde60a405b1118618affc7cc65 |
|
BLAKE2b-256 | 4d0e4015f8439a252f061f0f52a681235b3d74df68556bcb2a883d3eb1780eb0 |
File details
Details for the file pyscalpel-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: pyscalpel-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.6.10 Linux/4.15.0-121-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e558e5e07d9932c8521e8a373e3858705267fd2293936822b66e75666de7730e |
|
MD5 | 6892a24e89d73bf2fd9b49e74335f1c6 |
|
BLAKE2b-256 | a99fdba256b244b0d86c32d9ed4bdd9d53b60ecbce7317ec263b5f4335638e99 |