acrawler·PyPI

A simple web-crawling framework, based on aiohttp.

Project description

🔍 A powerful web-crawling framework, based on aiohttp.

Feature

Write your crawler in one Python script with asyncio
Schedule task with priority, fingerprint, exetime, recrawl…
Middleware: add handlers before or after task’s execution
Simple shortcuts to speed up scripting
Parse html conveniently with Parsel
Parse with rules and chained processors
Support JavaScript/browser-automation with pyppeteer
Stop and Resume: crawl periodically and persistently
Distributed work support with Redis

Installation

To install, simply use pipenv (or pip):

$ pipenv install acrawler

(Optional)
$ pipenv install uvloop      #(only Linux/macOS, for faster asyncio event loop)
$ pipenv install aioredis    #(if you need Redis support)
$ pipenv install motor       #(if you need MongoDB support)
$ pipenv install aiofiles    #(if you need FileRequest)

Documentation

Documentation and tutorial are available online at https://acrawler.readthedocs.io/ and in the docs directory.

Sample Code

Scrape imdb.com

from acrawler import Crawler, Request, ParselItem, Handler, register, get_logger


class MovieItem(ParselItem):
   log = True
   css = {
      # just some normal css rules
      # see Parsel for detailed information
      "date": ".subtext a[href*=releaseinfo]::text",
      "time": ".subtext time::text",
      "rating": "span[itemprop=ratingValue]::text",
      "rating_count": "span[itemprop=ratingCount]::text",
      "metascore": ".metacriticScore span::text",

      # if you provide a list with additional functions,
      # they are considered as field processor function
      "title": ["h1::text", str.strip],

      # the following four fules is for getting all matching values
      # the rule starts with [ and ends with ] comparing to normal rules
      "genres": "[.subtext a[href*=genres]::text]",
      "director": "[h4:contains(Director) ~ a[href*=name]::text]",
      "writers": "[h4:contains(Writer) ~ a[href*=name]::text]",
      "stars": "[h4:contains(Star) ~ a[href*=name]::text]",
   }


class IMDBCrawler(Crawler):
   config = {"MAX_REQUESTS": 4, "DOWNLOAD_DELAY": 1}

   async def start_requests(self):
      yield Request("https://www.imdb.com/chart/moviemeter", callback=self.parse)

   def parse(self, response):
      yield from response.follow(
            ".lister-list tr .titleColumn a::attr(href)", callback=self.parse_movie
      )

   def parse_movie(self, response):
      url = response.url_str
      yield MovieItem(response.sel, extra={"url": url.split("?")[0]})


@register()
class HorrorHandler(Handler):
   family = "MovieItem"
   logger = get_logger("horrorlog")

   async def handle_after(self, item):
      if item["genres"] and "Horror" in item["genres"]:
            self.logger.warning(f"({item['title']}) is a horror movie!!!!")


@MovieItem.bind()
def process_time(value):
   # a self-defined field processing function
   # process time to minutes
   # '3h 1min' -> 181
   if value:
      res = 0
      segs = value.split(" ")
      for seg in segs:
            if seg.endswith("min"):
               res += int(seg.replace("min", ""))
            elif seg.endswith("h"):
               res += 60 * int(seg.replace("h", ""))
      return res
   return value


if __name__ == "__main__":
   IMDBCrawler().run()

Scrape quotes.toscrape.com

# Scrape quotes from http://quotes.toscrape.com/
from acrawler import Parser, Crawler, ParselItem, Request


logger = get_logger("quotes")


class QuoteItem(ParselItem):
   log = True
   default = {"type": "quote"}
   css = {"author": "small.author::text"}
   xpath = {"text": ['.//span[@class="text"]/text()', lambda s: s.strip("“")[:20]]}


class AuthorItem(ParselItem):
   log = True
   default = {"type": "author"}
   css = {"name": "h3.author-title::text", "born": "span.author-born-date::text"}

class QuoteCrawler(Crawler):

   main_page = r"quotes.toscrape.com/page/\d+"
   author_page = r"quotes.toscrape.com/author/.*"
   parsers = [
      Parser(
            in_pattern=main_page,
            follow_patterns=[main_page, author_page],
            item_type=QuoteItem,
            css_divider=".quote",
      ),
      Parser(in_pattern=author_page, item_type=AuthorItem),
   ]

   async def start_requests(self):
      yield Request(url="http://quotes.toscrape.com/page/1/")


if __name__ == "__main__":
   QuoteCrawler().run()

See examples.

Todo

Add delta_key support for request
Cralwer’s name for distinguishing
Command Line config support
Monitor all crawlers in web
Write detailed Documentation
Write testing code

Project details

Release history Release notifications | RSS feed

0.1.6

Nov 18, 2019

0.1.5

Nov 7, 2019

0.1.4

Sep 17, 2019

This version

0.1.3

Aug 24, 2019

0.1.2a0 pre-release

Aug 24, 2019

0.1.1

Aug 23, 2019

0.1.0

Aug 23, 2019

0.0.9

Jul 28, 2019

0.0.8

Jul 2, 2019

0.0.7

Jun 4, 2019

0.0.6

May 27, 2019

0.0.5

May 21, 2019

0.0.4

May 20, 2019

0.0.3

May 18, 2019

0.0.2

May 15, 2019

0.0.1

May 9, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acrawler-0.1.3.tar.gz (32.4 kB view details)

Uploaded Aug 24, 2019 Source

File details

Details for the file acrawler-0.1.3.tar.gz.

File metadata

Download URL: acrawler-0.1.3.tar.gz
Upload date: Aug 24, 2019
Size: 32.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for acrawler-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`d5a5b0244ad567336cea665275db464ad2a22c2e843e772b0956d18b20270dc7`
MD5	`b9f6b2a7a7423f179e57f91bf38e0547`
BLAKE2b-256	`e06315471255b49d989a23b3fa7c9ffd42772d4cb52de2e286135ec04ac7b7e6`

See more details on using hashes here.

acrawler 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta