Skip to main content

A scalable frontier for web crawlers

Project description

new_frontera

pypi python versions Build Status codecov

Overview

new_frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

new_frontera takes care of the logic and policies to follow during the crawl. It stores and prioritizes links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

  • Online operation: small requests batches, with parsing done right after fetch.
  • Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
  • Two run modes: single process and distributed.
  • Built-in SqlAlchemy, Redis and HBase backends.
  • Built-in Apache Kafka and ZeroMQ message buses.
  • Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
  • Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
  • Transparent data flow, allowing to integrate custom components easily using Kafka.
  • Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
  • Optional use of Scrapy for fetching and parsing.
  • 3-clause BSD license, allowing to use in any commercial product.
  • Python 3 support.

Installation

Development version:

$ pip install git+https://github.com/ZeroCool940711/new_frontera.git

or from PyPi:

$ pip install new-frontera

Documentation

Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

new_frontera-0.9.0.tar.gz (128.0 kB view details)

Uploaded Source

Built Distribution

new_frontera-0.9.0-py3-none-any.whl (125.2 kB view details)

Uploaded Python 3

File details

Details for the file new_frontera-0.9.0.tar.gz.

File metadata

  • Download URL: new_frontera-0.9.0.tar.gz
  • Upload date:
  • Size: 128.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for new_frontera-0.9.0.tar.gz
Algorithm Hash digest
SHA256 36fbbfa932c2799463abd2f51b9296410c08f879044000c78d65c9efaeda731e
MD5 aad9dc99a77d5b6f4d84564150f8930d
BLAKE2b-256 66575731e4d6fe79f265ea113c457f4c75c53b717d863d26de5b48a5cb1391f3

See more details on using hashes here.

File details

Details for the file new_frontera-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: new_frontera-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 125.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for new_frontera-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6a6c1dd1196cf0fab235ecbc058a91a2c344231c2500f5c98d37713df01cc4a2
MD5 cd0b20a217561ba6b0bc48b6a14a929a
BLAKE2b-256 ab72aed55c3d88901f8bb7b83ca83daf73ecc9336b1a0561ec1e89e33eeaf9ff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page