Skip to main content

A simple web-crawling framework, based on aiohttp.

Project description

PyPI Documentation Status

🔍 A powerful web-crawling framework, based on aiohttp.

Feature

  • Write your crawler in one Python script with asyncio

  • Schedule task with priority, fingerprint, exetime, recrawl…

  • Middleware: add handlers before or after task’s execution

  • Simple shortcuts to speed up scripting

  • Parse html conveniently with Parsel

  • Parse with rules and chained processors

  • Support JavaScript/browser-automation with pyppeteer

  • Stop and Resume: crawl periodically and persistently

  • Distributed work support with Redis

Installation

To install, simply use pipenv (or pip):

$ pipenv install acrawler

(Optional)
$ pipenv install uvloop      #(only Linux/macOS, for faster asyncio event loop)
$ pipenv install aioredis    #(if you need Redis support)
$ pipenv install motor       #(if you need MongoDB support)
$ pipenv install aiofiles    #(if you need FileRequest)

Documentation

Documentation and tutorial are available online at https://acrawler.readthedocs.io/ and in the docs directory.

Sample Code

Scrape imdb.com

from acrawler import Crawler, Request, ParselItem, Handler, register, get_logger


class MovieItem(ParselItem):
   log = True
   css = {
      # just some normal css rules
      # see Parsel for detailed information
      "date": ".subtext a[href*=releaseinfo]::text",
      "time": ".subtext time::text",
      "rating": "span[itemprop=ratingValue]::text",
      "rating_count": "span[itemprop=ratingCount]::text",
      "metascore": ".metacriticScore span::text",

      # if you provide a list with additional functions,
      # they are considered as field processor function
      "title": ["h1::text", str.strip],

      # the following four fules is for getting all matching values
      # the rule starts with [ and ends with ] comparing to normal rules
      "genres": "[.subtext a[href*=genres]::text]",
      "director": "[h4:contains(Director) ~ a[href*=name]::text]",
      "writers": "[h4:contains(Writer) ~ a[href*=name]::text]",
      "stars": "[h4:contains(Star) ~ a[href*=name]::text]",
   }


class IMDBCrawler(Crawler):
   config = {"MAX_REQUESTS": 4, "DOWNLOAD_DELAY": 1}

   async def start_requests(self):
      yield Request("https://www.imdb.com/chart/moviemeter", callback=self.parse)

   def parse(self, response):
      yield from response.follow(
            ".lister-list tr .titleColumn a::attr(href)", callback=self.parse_movie
      )

   def parse_movie(self, response):
      url = response.url_str
      yield MovieItem(response.sel, extra={"url": url.split("?")[0]})


@register()
class HorrorHandler(Handler):
   family = "MovieItem"
   logger = get_logger("horrorlog")

   async def handle_after(self, item):
      if item["genres"] and "Horror" in item["genres"]:
            self.logger.warning(f"({item['title']}) is a horror movie!!!!")


@MovieItem.bind()
def process_time(value):
   # a self-defined field processing function
   # process time to minutes
   # '3h 1min' -> 181
   if value:
      res = 0
      segs = value.split(" ")
      for seg in segs:
            if seg.endswith("min"):
               res += int(seg.replace("min", ""))
            elif seg.endswith("h"):
               res += 60 * int(seg.replace("h", ""))
      return res
   return value


if __name__ == "__main__":
   IMDBCrawler().run()

Scrape quotes.toscrape.com

# Scrape quotes from http://quotes.toscrape.com/
from acrawler import Parser, Crawler, ParselItem, Request


logger = get_logger("quotes")


class QuoteItem(ParselItem):
   log = True
   default = {"type": "quote"}
   css = {"author": "small.author::text"}
   xpath = {"text": ['.//span[@class="text"]/text()', lambda s: s.strip("“")[:20]]}


class AuthorItem(ParselItem):
   log = True
   default = {"type": "author"}
   css = {"name": "h3.author-title::text", "born": "span.author-born-date::text"}

class QuoteCrawler(Crawler):

   main_page = r"quotes.toscrape.com/page/\d+"
   author_page = r"quotes.toscrape.com/author/.*"
   parsers = [
      Parser(
            in_pattern=main_page,
            follow_patterns=[main_page, author_page],
            item_type=QuoteItem,
            css_divider=".quote",
      ),
      Parser(in_pattern=author_page, item_type=AuthorItem),
   ]

   async def start_requests(self):
      yield Request(url="http://quotes.toscrape.com/page/1/")


if __name__ == "__main__":
   QuoteCrawler().run()

See examples.

Todo

  • Add delta_key support for request

  • Cralwer’s name for distinguishing

  • Command Line config support

  • Monitor all crawlers in web

  • Write detailed Documentation

  • Write testing code

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acrawler-0.1.3.tar.gz (32.4 kB view details)

Uploaded Source

File details

Details for the file acrawler-0.1.3.tar.gz.

File metadata

  • Download URL: acrawler-0.1.3.tar.gz
  • Upload date:
  • Size: 32.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for acrawler-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d5a5b0244ad567336cea665275db464ad2a22c2e843e772b0956d18b20270dc7
MD5 b9f6b2a7a7423f179e57f91bf38e0547
BLAKE2b-256 e06315471255b49d989a23b3fa7c9ffd42772d4cb52de2e286135ec04ac7b7e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page