The fastest web crawler written in Rust ported to nodejs.
Project description
spider-py
The spider project ported to Python.
Getting Started
pip install spider_rs
import asyncio
from spider_rs import crawl
async def main():
website = await crawl("https://choosealicense.com")
print(website.links)
# print(website.pages)
asyncio.run(main())
Use the Website class to build the crawler you need.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com", False).with_headers({ "authorization": "myjwttoken" })
website.crawl()
print(website.get_links())
asyncio.run(main())
Setting up real time subscriptions can be done too.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - status: " + str(page.status_code))
async def main():
website = Website("https://choosealicense.com", False)
website.crawl(Subscription())
asyncio.run(main())
Development
Install maturin pipx install maturin
and python.
maturin develop
Benchmarks
View bench to see the results. The library should run faster than scrappy by at least 2x for normal workflows and up to 100,000x for large websites. The speed increases drastically as we harness isolated concurrency from Rust.
Todo
- Fix http headers custom assign.
- Add better docs.
- Fix benchmarks.
Once these items are done the base of the module should be complete. Most of the code comes from the initial port to Node.js that was done.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
spider_rs-0.0.8.tar.gz
(30.5 kB
view hashes)
Built Distribution
Close
Hashes for spider_rs-0.0.8-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c90cd9fbbfe41459049c0dcfeb93aa76471e969cac2215c80e686045f0150cd |
|
MD5 | 44429b8bb205bfc92f87faba2d783758 |
|
BLAKE2b-256 | d10f0ffe6bb57870ca60c5558bf8bddbafa51f89aba4a5ae573f2744ab872494 |